[gmx-users] using dual CPU's

Mon Dec 10 15:24:51 CET 2018

Thank you, Mark, for the prompt response. I realize the limitations of the system ( its over 8 yo ), but I did not expect the speed to decrease by 50% with 12 available threads ! No combination of ntomp, ntmpi could raise ns/day above 4 with two GPU, vs 6 with one GPU.

This is actually a learning/practice run for a new build - an AMD 4.2 Ghz 32 core TR, 64G ram. In this case I am trying to decide upon either a RTX 2080 ti or two GTX 1080 TI. I'd prefer the two 1080's for the 7000 cores vs the 4500 cores of the 2080. The model systems will have ~ million particles and need the speed. But this is a major expense so I need to get it right.
I'll do as you suggest and report the results for both systems and I really appreciate the assist.
Paul
UMN, BICB

On Dec 9 2018, at 4:32 pm, paul buscemi <pbuscemi at q.com> wrote:
>
> Dear Users,
> I have good luck using a single GPU with the basic setup.. However in going from one gtx 1060 to a system with two - 50,000 atoms - the rate decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent ( water ) and a lipid membrane
> the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram.
> with the basic command " mdrun I get:
> ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10#
> Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
> Changing nstlist from 10 to 100, rlist from 1 to 1
>
> Using 2 MPI threads
> Using 6 OpenMP threads per tMPI thread
>
> On host I7 2 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
> PP:0,PP:1
>
> Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10#
> Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10#
> NOTE: DLB will not turn on during the first phase of PME tuning
> starting mdrun 'SR-TA'
> 100000 steps, 100.0 ps.
> and ending with ^C
>
> Received the INT signal, stopping within 200 steps
>
> Dynamic load balancing report:
> DLB was locked at the end of the run due to unfinished PP-PME balancing.
> Average load imbalance: 0.7%.
> The balanceable part of the MD step is 46%, load imbalance is computed from this.
> Part of the total run time spent waiting due to load imbalance: 0.3%.
>
>
> Core t (s) Wall t (s) (%)
> Time: 543.475 45.290 1200.0
> (ns/day) (hour/ns)
> Performance: 1.719 13.963 before DBL is turned on
>
> Very poor performance. I have been following - or trying to follow - "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov - 2016 but have not yet broken the code.
> ----------------
> gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on.
>
>
> Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13#
> Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
> Changing nstlist from 10 to 100, rlist from 1 to 1
>
> Using 2 MPI threads
> Using 3 OpenMP threads per tMPI thread
>
> On host I7 2 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
> PP:0,PP:1
>
> Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13#
> Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13#
> NOTE: DLB will not turn on during the first phase of PME tuning
> starting mdrun 'SR-TA'
> 100000 steps, 100.0 ps.
>
> NOTE: DLB can now turn on, when beneficial
> ^C
>
> Received the INT signal, stopping within 200 steps
>
> Dynamic load balancing report:
> DLB was off during the run due to low measured imbalance.
> Average load imbalance: 0.7%.
> The balanceable part of the MD step is 46%, load imbalance is computed from this.
> Part of the total run time spent waiting due to load imbalance: 0.3%.
>
>
> Core t (s) Wall t (s) (%)
> Time: 953.837 158.973 600.0
> (ns/day) (hour/ns)
> Performance: 2.935 8.176
>
> ====================
> the beginning of the log file is
> GROMACS version: 2018.3
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support: CUDA
> SIMD instructions: SSE4.1
> FFT library: fftw-3.3.8-sse2
> RDTSCP usage: enabled
> TNG support: enabled
> Hwloc support: disabled
> Tracing support: disabled
> Built on: 2018-10-19 21:26:38
> Built by: pb at Q4 [CMAKE]
> Build OS/arch: Linux 4.15.0-20-generic x86_64
> Build CPU vendor: Intel
> Build CPU brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
> Build CPU family: 6 Model: 44 Stepping: 2
> Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> C compiler: /usr/bin/gcc-6 GNU 6.4.0
> C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> C++ compiler: /usr/bin/g++-6 GNU 6.4.0
> C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
> CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver: 9.10
> CUDA runtime: 9.10
>
>
> Running on 1 node with total 12 cores, 12 logical cores, 2 compatible GPUs
> Hardware detected:
> CPU info:
> Vendor: Intel
> Brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
> Family: 6 Model: 44 Stepping: 2
> Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> Hardware topology: Only logical processor count
> GPU info:
> Number of GPUs detected: 2
> #0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
> #1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat: compatible
>
>
> There were no errors encountered during the runs. Suggestions would be appreciated.
> Regards
> Paul
>