[gmx-users] using dual CPU's

Sun Dec 9 23:32:10 CET 2018

Dear Users,

I have good luck using a single GPU with the basic setup.. However in going from one gtx 1060 to a system with two - 50,000 atoms - the rate decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent ( water ) and a lipid membrane
the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram.
with the basic command " mdrun I get:
ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1

Using 2 MPI threads
Using 6 OpenMP threads per tMPI thread

On host I7 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1

Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.
and ending with ^C

Received the INT signal, stopping within 200 steps

Dynamic load balancing report:
DLB was locked at the end of the run due to unfinished PP-PME balancing.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.

Core t (s) Wall t (s) (%)
Time: 543.475 45.290 1200.0
(ns/day) (hour/ns)
Performance: 1.719 13.963 before DBL is turned on

Very poor performance. I have been following - or trying to follow - "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov - 2016 but have not yet broken the code.
----------------
gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on.

Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13#
Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
Changing nstlist from 10 to 100, rlist from 1 to 1

Using 2 MPI threads
Using 3 OpenMP threads per tMPI thread

On host I7 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
PP:0,PP:1

Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13#
Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'SR-TA'
100000 steps, 100.0 ps.

NOTE: DLB can now turn on, when beneficial
^C

Received the INT signal, stopping within 200 steps

Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.7%.
The balanceable part of the MD step is 46%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.3%.

Core t (s) Wall t (s) (%)
Time: 953.837 158.973 600.0
(ns/day) (hour/ns)
Performance: 2.935 8.176

====================
the beginning of the log file is
GROMACS version: 2018.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: SSE4.1
FFT library: fftw-3.3.8-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-10-19 21:26:38
Built by: pb at Q4 [CMAKE]
Build OS/arch: Linux 4.15.0-20-generic x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
Build CPU family: 6 Model: 44 Stepping: 2
Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/gcc-6 GNU 6.4.0
C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/g++-6 GNU 6.4.0
C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.10
CUDA runtime: 9.10

Running on 1 node with total 12 cores, 12 logical cores, 2 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
Family: 6 Model: 44 Stepping: 2
Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
Hardware topology: Only logical processor count
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible

There were no errors encountered during the runs. Suggestions would be appreciated.
Regards
Paul