[gmx-users] performance issue with many short MD runs

Mon Mar 27 18:46:23 CEST 2017

Hi,

As Peter notes, there are cases where the GPU won't be used for the rerun
(specifically, when you request more than one energy group, for which it
would likely be prohibitively slow, even if we'd write and run such a
kernel on the GPU; but that is not the case here). The reason things take a
long time is that a rerun has a wildly different execution profile from
normal mdrun. Each "step" has to get the positions from some cold part of
memory/disk, do a fresh neighbor search (since mdrun can't rely on the
usual assumption that you can re-use the last one quite a few times),
launch a GPU kernel, launch CPU OpenMP regions, compute forces that often
won't even be used for output, and write whatever should be output. Most of
that code is run very rarely in a normal production simulation, so isn't
heavily optimized. But your rerun is spending most of its time there. Since
you note that your compute load is a single small molecule, it would not be
at all surprising for the mdrun performance breakdown in the log file to
show that all the overheads take very much more time than the GPU kernel
that computes the energy that you want. Those can take wildly different
amounts of time on different machines for all sorts of reasons, including
CUDA API overhead (as Peter noted), Linux kernel configuration, OS version,
hard disk performance, machine load, whether the sysadmin showered lately,
the phase of the moon, etc. :-)

Compare the final sections of the log files to see what I mean. Try gmx
mdrun -rerun -nb cpu, as it might be faster to waste the GPU. If you really
are doing many machine-hours of such jobs and care about turn-around time,
invest human time in writing a script to break up your trajectory into
pieces, and give each piece to a single mdrun that you place on e.g. a
different single core (e.g. with tools like numactl or taskset) and run a
different gmx mdrun -rerun -nb cpu -ntmpi 1 -ntomp 1 on each single core.

Mark

On Mon, Mar 27, 2017 at 4:24 PM Peter Kroon <p.c.kroon at rug.nl> wrote:

> Hi,
>
>
> On the new machine your CUDA runtime and driver versions are lower than
> on the old machine. Maybe that could explain it? (is the GPU even used
> with -rerun?) You would need to recompile gromacs.
>
>
> Peter
>
>
> On 27-03-17 15:51, Michael Brunsteiner wrote:
> > Hi,I have to run a lot (many thousands) of very short MD reruns with
> gmx.Using gmx-2016.3 it works without problems, however, what i see is
> thatthe overall performance (in terms of REAL execution time as measured
> with the unix time command)which I get on a relatively new computer is
> poorer than what i get with a much older machine
> > (by a factor of about 2 -  this in spite of gmx reporting a better
> performance of the new machine in thelog file)
> >
> > both machines run linux (debian), the old has eight intel cores the
> newer one 12.
> > on the newer machine gmx uses a supposedly faster SIMD instruction
> setotherwise hardware (including hard drives) is comparable.
> >
> > below output of a typical job (gmx mdrun -rerun with a trajectory
> containingnot more than a couple of thousand conformations of a single
> small molecule)on both machines (mdp file content below)
> >
> > old machine:prompt> time gmx mdrun ...
> > in the log file:
> >                Core t (s)   Wall t (s)        (%)
> >        Time:        4.527        0.566      800.0
> >                  (ns/day)    (hour/ns)
> > Performance:        1.527       15.719
> > on the command line:
> > real    2m45.562s  <====================================
> > user    15m40.901s
> > sys     0m33.319s
> >
> > new machine:
> > prompt> time gmx mdrun ...
> > in the log file:               Core t (s)   Wall t (s)        (%)
> >        Time:        6.030        0.502     1200.0
> >                  (ns/day)    (hour/ns)
> > Performance:        1.719       13.958
> >
> > on the command line:real    5m30.962s
> <====================================
> > user    20m2.208s
> > sys     3m28.676s
> >
> >  The specs of the two gmx installations are given below.I'd be grateful
> if anyone could suggest ways to improve performanceon the newer machine!
> > cheers,Michael
> >
> >
> > the older machine (here the jobs run faster):  gmx --version
> >
> > GROMACS version:    2016.3
> > Precision:          single
> > Memory model:       64 bit
> > MPI library:        thread_mpi
> > OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support:        CUDA
> > SIMD instructions:  SSE4.1
> > FFT library:        fftw-3.3.5-sse2
> > RDTSCP usage:       enabled
> > TNG support:        enabled
> > Hwloc support:      hwloc-1.8.0
> > Tracing support:    disabled
> > Built on:           Tue Mar 21 11:24:42 CET 2017
> > Built by:           root at rcpetemp1 [CMAKE]
> > Build OS/arch:      Linux 3.13.0-79-generic x86_64
> > Build CPU vendor:   Intel
> > Build CPU brand:    Intel(R) Core(TM) i7 CPU         960  @ 3.20GHz
> > Build CPU family:   6   Model: 26   Stepping: 5
> > Build CPU features: apic clfsh cmov cx8 cx16 htt lahf mmx msr
> nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> > C compiler:         /usr/bin/cc GNU 4.8.4
> > C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> > C++ compiler:       /usr/bin/c++ GNU 4.8.4
> > C++ compiler flags:  -msse4.1    -std=c++0x   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> > CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on
> Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
> > CUDA compiler
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;;;-Xcompiler;,-msse4.1,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
> > CUDA driver:        7.50
> > CUDA runtime:       7.50
> >
> >
> >
> > the newer machine (here execution is slower by a factor 2):  gmx
> --version
> >
> > GROMACS version:    2016.3
> > Precision:          single
> > Memory model:       64 bit
> > MPI library:        thread_mpi
> > OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support:        CUDA
> > SIMD instructions:  AVX_256
> > FFT library:        fftw-3.3.5
> > RDTSCP usage:       enabled
> > TNG support:        enabled
> > Hwloc support:      hwloc-1.10.0
> > Tracing support:    disabled
> > Built on:           Fri Mar 24 11:18:29 CET 2017
> > Built by:           root at rcpe-sbd-node01 [CMAKE]
> > Build OS/arch:      Linux 3.14-2-amd64 x86_64
> > Build CPU vendor:   Intel
> > Build CPU brand:    Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
> > Build CPU family:   6   Model: 62   Stepping: 4
> > Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf mmx
> msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2
> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > C compiler:         /usr/bin/cc GNU 4.9.2
> > C compiler flags:    -mavx     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> > C++ compiler:       /usr/bin/c++ GNU 4.9.2
> > C++ compiler flags:  -mavx    -std=c++0x   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> > CUDA compiler:      /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
> Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
> > CUDA compiler
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;;;-Xcompiler;,-mavx,,,,,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
> > CUDA driver:        6.50
> > CUDA runtime:       5.50
> >
> >
> >
> > mdp-file:
> >
> > integrator               = md
> > dt                       = 0.001
> > nsteps                   = 0
> > comm-grps                = System
> > cutoff-scheme            = verlet
> > ;
> > nstxout                  = 0
> > nstvout                  = 0
> > nstfout                  = 0
> > nstlog                   = 0
> > nstenergy                = 1
> > ;
> > nstlist                  = 10000
> > ns_type                  = grid
> > pbc                      = xyz
> > rlist                    = 3.9
> > ;
> > coulombtype              = cut-off
> > rcoulomb                 = 3.9
> > vdw_type                 = cut-off
> > rvdw                     = 3.9
> > DispCorr                 = no
> > ;
> > constraints              = none
> > ;
> > continuation             = yes
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.