[gmx-users] performance issue with many short MD runs

Mon Mar 27 15:52:01 CEST 2017

Hi,I have to run a lot (many thousands) of very short MD reruns with gmx.Using gmx-2016.3 it works without problems, however, what i see is thatthe overall performance (in terms of REAL execution time as measured with the unix time command)which I get on a relatively new computer is poorer than what i get with a much older machine 
(by a factor of about 2 -  this in spite of gmx reporting a better performance of the new machine in thelog file)

both machines run linux (debian), the old has eight intel cores the newer one 12. 
on the newer machine gmx uses a supposedly faster SIMD instruction setotherwise hardware (including hard drives) is comparable.

below output of a typical job (gmx mdrun -rerun with a trajectory containingnot more than a couple of thousand conformations of a single small molecule)on both machines (mdp file content below)

old machine:prompt> time gmx mdrun ...
in the log file:
               Core t (s)   Wall t (s)        (%)
       Time:        4.527        0.566      800.0
                 (ns/day)    (hour/ns)
Performance:        1.527       15.719
on the command line:
real    2m45.562s  <====================================
user    15m40.901s
sys     0m33.319s

new machine:
prompt> time gmx mdrun ...
in the log file:               Core t (s)   Wall t (s)        (%)
       Time:        6.030        0.502     1200.0
                 (ns/day)    (hour/ns)
Performance:        1.719       13.958

on the command line:real    5m30.962s  <====================================
user    20m2.208s
sys     3m28.676s

 The specs of the two gmx installations are given below.I'd be grateful if anyone could suggest ways to improve performanceon the newer machine!
cheers,Michael

the older machine (here the jobs run faster):  gmx --version

GROMACS version:    2016.3
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        CUDA
SIMD instructions:  SSE4.1
FFT library:        fftw-3.3.5-sse2
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.8.0
Tracing support:    disabled
Built on:           Tue Mar 21 11:24:42 CET 2017
Built by:           root at rcpetemp1 [CMAKE]
Build OS/arch:      Linux 3.13.0-79-generic x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Core(TM) i7 CPU         960  @ 3.20GHz
Build CPU family:   6   Model: 26   Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf mmx msr nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler:         /usr/bin/cc GNU 4.8.4
C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
C++ compiler:       /usr/bin/c++ GNU 4.8.4
C++ compiler flags:  -msse4.1    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;;;-Xcompiler;,-msse4.1,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 
CUDA driver:        7.50
CUDA runtime:       7.50

the newer machine (here execution is slower by a factor 2):  gmx --version

GROMACS version:    2016.3
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        CUDA
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.5
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.10.0
Tracing support:    disabled
Built on:           Fri Mar 24 11:18:29 CET 2017
Built by:           root at rcpe-sbd-node01 [CMAKE]
Build OS/arch:      Linux 3.14-2-amd64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Build CPU family:   6   Model: 62   Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /usr/bin/cc GNU 4.9.2
C compiler flags:    -mavx     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
C++ compiler:       /usr/bin/c++ GNU 4.9.2
C++ compiler flags:  -mavx    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  
CUDA compiler:      /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;;;-Xcompiler;,-mavx,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,; 
CUDA driver:        6.50
CUDA runtime:       5.50

mdp-file:

integrator               = md
dt                       = 0.001
nsteps                   = 0
comm-grps                = System
cutoff-scheme            = verlet
;
nstxout                  = 0
nstvout                  = 0
nstfout                  = 0
nstlog                   = 0
nstenergy                = 1
;
nstlist                  = 10000
ns_type                  = grid
pbc                      = xyz
rlist                    = 3.9
;
coulombtype              = cut-off
rcoulomb                 = 3.9
vdw_type                 = cut-off
rvdw                     = 3.9
DispCorr                 = no
;
constraints              = none
;
continuation             = yes