[gmx-users] 35% Variability in Simulation Performance

Mitchell Dorrell mwd at udel.edu
Thu Apr 30 18:29:09 CEST 2015


Hello again, using three PME ranks per node did better, but still didn't
reach the 110 ns/day that the "fast" simulation achieved.
I ran it like this:
aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 00000 -npme 384 -dlb yes
-pin on -resethway -noconfout -v
and got 87.752 ns/day. Here is the full log: http://pastebin.com/mtyrQSXH

I also tried two threads:
OMP_NUM_THREADS=2 aprun -n 512 -S 2 -j 1 -d 2 mdrun_mpi -gpu_id 000 -npme
128 -dlb yes -pin on -resethway -noconfout -v
which yielded 79.844 ns/day. Log: http://pastebin.com/vQkT9xY0

and four threads, but with only two ranks per node, I can't maintain the
ratio of one PME rank to every three PP ranks, so I went half-and-half
instead:
OMP_NUM_THREADS=4 aprun -n 256 -S 4 -j 1 -d 4 mdrun_mpi -npme 128 -dlb yes
-pin on -resethway -noconfout -v
which yielded 89.727 ns/day. Log: http://pastebin.com/Ru0rR3L2

I don't think I can do eight threads per rank, because that implies one
rank per node, which leads to PME nodes not using the GPU and complaining.

Looking over the logs, I didn't intend for my original commands (in the
simulations in my first email) to use two OpenMP threads per rank. I
actually thought I was instructing it *not* to do that by using the "-d"
parameter to aprun. Were those two threads running on the same core? Or
were they running on consecutive cores, which would then fight for the
shared APU?

Hello Mark, thank you for the information... all the details regarding
threads and ranks have been making my head spin. I tried running the same
command (the one that got 110 ns/day) with dlb turned off:
CRAY_CUDA_PROXY=1 aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 000000
-npme 256 -dlb no -pin on -resethway -noconfout -v
which yielded 97.003 ns/day. Log: http://pastebin.com/73asZA6P

I understand how to manually balance the gpu_id parameter with the npme
parameter, but if I let Gromacs try to determine the right number of PME
ranks, then the gpu_id parameter needs to vary from node-to-node in a way
that I can't predict before running the simulation (since I don't know what
Gromacs will decide is the optimum number of PME ranks). Do I have a
fundamental misunderstanding?

I understand that ntomp and the environment variable OMP_NUM_THREADS are
exactly interchangeable (is that correct?). I haven't been touching
ntomp_pme or GMX_PME_NUM_THREADS, should I be? You mention 1, 2, 3, and
6... where do these numbers come from? One and two make sense to me, but I
do not understand what to expect by using three or six (which means I may
not really understand one or two either).

My greatest concern is the variability between running identical
simulations on identical hardware... It doesn't make a lot of sense to me
to test a lot of different options if my results can vary up to 35%. How
can I tell that a 10% improvement is not actually just an "upward
variation" on what should actually be poorer performance? 97 ns/day is
good, but shouldn't I be able to reproduce 110 ns/day somehow?

Also, the system has around 206k atoms, rectangular box periodic
boundaries, roughly 17nm by 17nm by 8nm. It's a membrane simulation, so
there's some variability in the Z direction, but things are comparatively
uniform in the XY plane. Regarding the hardware, this link explains it
pretty well (additionally, there's one K20X Tesla GPU per node):
https://www.olcf.ornl.gov/kb_articles/xk7-cpu-description/

Thank you so much for all of your help!
Mitchell Dorrell


On Thu, Apr 30, 2015 at 8:42 AM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:

> Hi,
>
> Some clues about the composition of the simulation system and hardware are
> needed to understand the full picture.
>
> On Thu, Apr 30, 2015 at 2:23 PM Mitchell Dorrell <mwd at udel.edu> wrote:
>
> > Hi Carsten, thank you very much for getting back to me. I'm already
> using a
> > couple of the parameters you suggest. These are the options I was using:
> >
> > mdrun_mpi -gpu_id 000000 -npme 256 -dlb yes -pin on -resethway -noconfout
> > -v
> >
> > which is being called through:
> >
> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi <mdrun options as shown above>
> >
> > Did you mean "try switching on dynamic load balancing" or "try switching
> > off dynamic load balancing"?
> >
>
> It'll generally turn itself on by default, so try switching it off.
>
> I'll try three PME ranks per node instead of two and attach the results
> > later today. I would like to let Gromacs decide, but I can't figure out
> how
> > to do that without getting mismatches between GPU numbers and PP ranks.
> >
>
> There's one GPU per node, so all you have to manage is the length of the
> -gpu_id string of zeroes.
>
>
> > Can you suggest a command line for additional OpenMP threads with fewer
> MPI
> > ranks? I've attempted it myself, and I always seem to suffer major
> > performance losses. I suspect I'm setting something incorrectly. I'll run
> > it again with my most recent parameters and reply with those results as
> > well.
> >
>
> This is where hardware details become important. GROMACS use of OpenMP
> works OK so long as the set of cores that run the threads are as close
> together as possible. On your AMD hardware, you will get a stepwise
> degradation of performance as the number of OpenMP threads grows, because
> the level of memory cache through which they interact gets further away
> from the processor. Cores that do not even share L3 cache are just not
> worth using as part of the same OpenMP set. (This is less severe on Intel
> processors.) Some background info here
> http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29. So I would
> expect that -ntomp 1,2,3,6 are all that are worth trying for you. The
> problem is that the PP (ie GPU-supported) side of things would prefer as
> many OpenMP threads as possible, but the other parts of the code need a lot
> more dev work before they can handle that well...
>
> Mark
>
> Thank you for your assistance,
> > Mitchell Dorrell
> > On Apr 30, 2015 1:10 AM, "Kutzner, Carsten" <ckutzne at gwdg.de> wrote:
> >
> > > Hi Mitchell,
> > >
> > > > On 30 Apr 2015, at 03:04, Mitchell Dorrell <mwd at udel.edu> wrote:
> > > >
> > > > Hi all, I just ran the same simulation twice (ignore the difference
> in
> > > > filenames), and got very different results. Obviously, I'd like to
> > > > reproduce the faster simulation. I expect this probably has to do
> with
> > > > automatically-tuned parameters, but I'm not sure what is really going
> > on.
> > > >
> > > > Links to the two log files:
> > > > http://pastebin.com/Wq3uAyUv
> > > > http://pastebin.com/sU5Qhs5h
> > > >
> > > > Also, any tips for further boosting the performance?
> > > there are several things you can try out here.
> > >
> > > From the log files I see that in one case the tuning of PME grid
> spacing
> > > versus Coulomb cutoff goes on until after time step 2000. During this
> > > tuning
> > > time, performance varies a lot and can be far from optimal. To get
> > > trustworthy performance numbers at the end of your run, you should
> > > therefore
> > > exclude the first 3000 steps or so from the performance measurement
> using
> > > the -resetstep 3000 command line parameter to mdrun (or, alternatively,
> > > -resethway).
> > >
> > > Another issue that is present in both simulations is that your 256 PME
> > > nodes
> > > have way more work to do than the PP (direct-space) nodes. So you could
> > > try to increase
> > > the fraction of PME nodes. Or try switching of dynamic load balancing,
> so
> > > that the Coulomb cutoff can grow larger (and the PME grid smaller)
> during
> > > PME/PP load balancing. With your system you seem to be near the
> > > parallelization
> > > limit, so it should help to use more than 2 OpenMP threads per MPI rank
> > > (thus a smaller number of MPI ranks). Try 4 or more, the reduced number
> > > of PME MPI ranks will greatly reduce the PME communication. At the same
> > > time, the
> > > PP domains will be larger, which is beneficial for load balancing.
> > >
> > > Carsten
> > >
> > >
> > > >
> > > > Thank you for your help!
> > > > Mitchell Dorrell
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > > --
> > > Dr. Carsten Kutzner
> > > Max Planck Institute for Biophysical Chemistry
> > > Theoretical and Computational Biophysics
> > > Am Fassberg 11, 37077 Goettingen, Germany
> > > Tel. +49-551-2012313, Fax: +49-551-2012302
> > > http://www.mpibpc.mpg.de/grubmueller/kutzner
> > > http://www.mpibpc.mpg.de/grubmueller/sppexa
> > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list