[gmx-users] performance issue with many short MD runs

Peter Kroon p.c.kroon at rug.nl
Mon Mar 27 16:24:29 CEST 2017


Hi,


On the new machine your CUDA runtime and driver versions are lower than
on the old machine. Maybe that could explain it? (is the GPU even used
with -rerun?) You would need to recompile gromacs.


Peter


On 27-03-17 15:51, Michael Brunsteiner wrote:
> Hi,I have to run a lot (many thousands) of very short MD reruns with gmx.Using gmx-2016.3 it works without problems, however, what i see is thatthe overall performance (in terms of REAL execution time as measured with the unix time command)which I get on a relatively new computer is poorer than what i get with a much older machine 
> (by a factor of about 2 -  this in spite of gmx reporting a better performance of the new machine in thelog file)
>
> both machines run linux (debian), the old has eight intel cores the newer one 12. 
> on the newer machine gmx uses a supposedly faster SIMD instruction setotherwise hardware (including hard drives) is comparable.
>
> below output of a typical job (gmx mdrun -rerun with a trajectory containingnot more than a couple of thousand conformations of a single small molecule)on both machines (mdp file content below)
>
> old machine:prompt> time gmx mdrun ...
> in the log file:
>                Core t (s)   Wall t (s)        (%)
>        Time:        4.527        0.566      800.0
>                  (ns/day)    (hour/ns)
> Performance:        1.527       15.719
> on the command line:
> real    2m45.562s  <====================================
> user    15m40.901s
> sys     0m33.319s
>
> new machine:
> prompt> time gmx mdrun ...
> in the log file:               Core t (s)   Wall t (s)        (%)
>        Time:        6.030        0.502     1200.0
>                  (ns/day)    (hour/ns)
> Performance:        1.719       13.958
>
> on the command line:real    5m30.962s  <====================================
> user    20m2.208s
> sys     3m28.676s
>
>  The specs of the two gmx installations are given below.I'd be grateful if anyone could suggest ways to improve performanceon the newer machine!
> cheers,Michael
>
>
> the older machine (here the jobs run faster):  gmx --version
>
> GROMACS version:    2016.3
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        CUDA
> SIMD instructions:  SSE4.1
> FFT library:        fftw-3.3.5-sse2
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      hwloc-1.8.0
> Tracing support:    disabled
> Built on:           Tue Mar 21 11:24:42 CET 2017
> Built by:           root at rcpetemp1 [CMAKE]
> Build OS/arch:      Linux 3.13.0-79-generic x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Core(TM) i7 CPU         960  @ 3.20GHz
> Build CPU family:   6   Model: 26   Stepping: 5
> Build CPU features: apic clfsh cmov cx8 cx16 htt lahf mmx msr nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> C compiler:         /usr/bin/cc GNU 4.8.4
> C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
> C++ compiler:       /usr/bin/c++ GNU 4.8.4
> C++ compiler flags:  -msse4.1    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;;;-Xcompiler;,-msse4.1,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 
> CUDA driver:        7.50
> CUDA runtime:       7.50
>
>
>
> the newer machine (here execution is slower by a factor 2):  gmx --version
>
> GROMACS version:    2016.3
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        CUDA
> SIMD instructions:  AVX_256
> FFT library:        fftw-3.3.5
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      hwloc-1.10.0
> Tracing support:    disabled
> Built on:           Fri Mar 24 11:18:29 CET 2017
> Built by:           root at rcpe-sbd-node01 [CMAKE]
> Build OS/arch:      Linux 3.14-2-amd64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
> Build CPU family:   6   Model: 62   Stepping: 4
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /usr/bin/cc GNU 4.9.2
> C compiler flags:    -mavx     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
> C++ compiler:       /usr/bin/c++ GNU 4.9.2
> C++ compiler flags:  -mavx    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
> CUDA compiler:      /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;;;-Xcompiler;,-mavx,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 
> CUDA driver:        6.50
> CUDA runtime:       5.50
>
>
>
> mdp-file:
>
> integrator               = md
> dt                       = 0.001
> nsteps                   = 0
> comm-grps                = System
> cutoff-scheme            = verlet
> ;
> nstxout                  = 0
> nstvout                  = 0
> nstfout                  = 0
> nstlog                   = 0
> nstenergy                = 1
> ;
> nstlist                  = 10000
> ns_type                  = grid
> pbc                      = xyz
> rlist                    = 3.9
> ;
> coulombtype              = cut-off
> rcoulomb                 = 3.9
> vdw_type                 = cut-off
> rvdw                     = 3.9
> DispCorr                 = no
> ;
> constraints              = none
> ;
> continuation             = yes




More information about the gromacs.org_gmx-users mailing list