[gmx-users] 35% Variability in Simulation Performance

Thu Apr 30 14:42:50 CEST 2015

Hi,

Some clues about the composition of the simulation system and hardware are
needed to understand the full picture.

On Thu, Apr 30, 2015 at 2:23 PM Mitchell Dorrell <mwd at udel.edu> wrote:

> Hi Carsten, thank you very much for getting back to me. I'm already using a
> couple of the parameters you suggest. These are the options I was using:
>
> mdrun_mpi -gpu_id 000000 -npme 256 -dlb yes -pin on -resethway -noconfout
> -v
>
> which is being called through:
>
> aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi <mdrun options as shown above>
>
> Did you mean "try switching on dynamic load balancing" or "try switching
> off dynamic load balancing"?
>

It'll generally turn itself on by default, so try switching it off.

I'll try three PME ranks per node instead of two and attach the results
> later today. I would like to let Gromacs decide, but I can't figure out how
> to do that without getting mismatches between GPU numbers and PP ranks.
>

There's one GPU per node, so all you have to manage is the length of the
-gpu_id string of zeroes.

> Can you suggest a command line for additional OpenMP threads with fewer MPI
> ranks? I've attempted it myself, and I always seem to suffer major
> performance losses. I suspect I'm setting something incorrectly. I'll run
> it again with my most recent parameters and reply with those results as
> well.
>

This is where hardware details become important. GROMACS use of OpenMP
works OK so long as the set of cores that run the threads are as close
together as possible. On your AMD hardware, you will get a stepwise
degradation of performance as the number of OpenMP threads grows, because
the level of memory cache through which they interact gets further away
from the processor. Cores that do not even share L3 cache are just not
worth using as part of the same OpenMP set. (This is less severe on Intel
processors.) Some background info here
http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29. So I would
expect that -ntomp 1,2,3,6 are all that are worth trying for you. The
problem is that the PP (ie GPU-supported) side of things would prefer as
many OpenMP threads as possible, but the other parts of the code need a lot
more dev work before they can handle that well...

Mark

Thank you for your assistance,
> Mitchell Dorrell
> On Apr 30, 2015 1:10 AM, "Kutzner, Carsten" <ckutzne at gwdg.de> wrote:
>
> > Hi Mitchell,
> >
> > > On 30 Apr 2015, at 03:04, Mitchell Dorrell <mwd at udel.edu> wrote:
> > >
> > > Hi all, I just ran the same simulation twice (ignore the difference in
> > > filenames), and got very different results. Obviously, I'd like to
> > > reproduce the faster simulation. I expect this probably has to do with
> > > automatically-tuned parameters, but I'm not sure what is really going
> on.
> > >
> > > Links to the two log files:
> > > http://pastebin.com/Wq3uAyUv
> > > http://pastebin.com/sU5Qhs5h
> > >
> > > Also, any tips for further boosting the performance?
> > there are several things you can try out here.
> >
> > From the log files I see that in one case the tuning of PME grid spacing
> > versus Coulomb cutoff goes on until after time step 2000. During this
> > tuning
> > time, performance varies a lot and can be far from optimal. To get
> > trustworthy performance numbers at the end of your run, you should
> > therefore
> > exclude the first 3000 steps or so from the performance measurement using
> > the -resetstep 3000 command line parameter to mdrun (or, alternatively,
> > -resethway).
> >
> > Another issue that is present in both simulations is that your 256 PME
> > nodes
> > have way more work to do than the PP (direct-space) nodes. So you could
> > try to increase
> > the fraction of PME nodes. Or try switching of dynamic load balancing, so
> > that the Coulomb cutoff can grow larger (and the PME grid smaller) during
> > PME/PP load balancing. With your system you seem to be near the
> > parallelization
> > limit, so it should help to use more than 2 OpenMP threads per MPI rank
> > (thus a smaller number of MPI ranks). Try 4 or more, the reduced number
> > of PME MPI ranks will greatly reduce the PME communication. At the same
> > time, the
> > PP domains will be larger, which is beneficial for load balancing.
> >
> > Carsten
> >
> >
> > >
> > > Thank you for your help!
> > > Mitchell Dorrell
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> > --
> > Dr. Carsten Kutzner
> > Max Planck Institute for Biophysical Chemistry
> > Theoretical and Computational Biophysics
> > Am Fassberg 11, 37077 Goettingen, Germany
> > Tel. +49-551-2012313, Fax: +49-551-2012302
> > http://www.mpibpc.mpg.de/grubmueller/kutzner
> > http://www.mpibpc.mpg.de/grubmueller/sppexa
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>