[gmx-users] 35% Variability in Simulation Performance

Wed May 6 06:06:02 CEST 2015

I didn't read the whole thread, just chiming in based on the title.

I can see up to 3x differences in performance when using clusters that don't pack nodes. If you're using a cluster that is set up this way, then it can make a huge difference how may cores you get on each node and how many switches are involved. 

On one cluster with 4X QDR non-blocking connections giving a 40 Gbit/s signal rate with a 32 Gbit/s data rate (apparently 1:1 QDR IB, but I'm just copying and pasting specs here) then it may not matter too much (probably less than 5% variation in my experience). However, if I use a similar cluster that instead has  a 2:1 IB blocking factor then I see massive swings in performance unless I pack nodes. You might want to talk to your sys admins about setting up a queue reservation.

Chris.

________________________________________
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Mark Abraham <mark.j.abraham at gmail.com>
Sent: 05 May 2015 14:38
To: gmx-users at gromacs.org
Subject: Re: [gmx-users] 35% Variability in Simulation Performance

Hi,

Sometimes you can't reproduce performance. The network is a shared resource
on almost all HPC platforms, and if someone else is using it, you're just
going to wait. Even if you fix all the mdrun parameters, your performance
depends on the properties of what else is running, and how close the nodes
you are allocated are to each other. Details vary a lot, but you should ask
your sysadmins to let you try some runs when the machine is quiet (e.g.
before or after scheduled maintenance). You likely want all your nodes
within the same chunk of network,  and probably there will be no way to ask
for that from the batch system. Sucks to be us.

Mark

On Thu, 30 Apr 2015 20:25 Mitchell Dorrell <mwd at udel.edu> wrote:

> I just tried a number of simulations with -notunepme, after having set the
> grid size in my mdp file and generating a new tpr. These are the lines I
> changed:
> rlist                   = 1.336
> rcoulomb                = 1.336
> rvdw                    = 1.336
> and these are the lines I added:
> fourierspacing           = 0.134
> fourier-nx               = 120
> fourier-ny               = 120
> fourier-nz               = 60
>
> I never broke 100ns/day, though. The notunepme parameter does seem to
> improve consistency from one simulation to the next, however.
>
> How do I repeat 110ns/day?
>
> Thanks,
> Mitchell Dorrell
>
>
> On Thu, Apr 30, 2015 at 12:28 PM, Mitchell Dorrell <mwd at udel.edu> wrote:
>
> > Hello again, using three PME ranks per node did better, but still didn't
> > reach the 110 ns/day that the "fast" simulation achieved.
> > I ran it like this:
> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 00000 -npme 384 -dlb yes
> > -pin on -resethway -noconfout -v
> > and got 87.752 ns/day. Here is the full log:
> http://pastebin.com/mtyrQSXH
> >
> > I also tried two threads:
> > OMP_NUM_THREADS=2 aprun -n 512 -S 2 -j 1 -d 2 mdrun_mpi -gpu_id 000 -npme
> > 128 -dlb yes -pin on -resethway -noconfout -v
> > which yielded 79.844 ns/day. Log: http://pastebin.com/vQkT9xY0
> >
> > and four threads, but with only two ranks per node, I can't maintain the
> > ratio of one PME rank to every three PP ranks, so I went half-and-half
> > instead:
> > OMP_NUM_THREADS=4 aprun -n 256 -S 4 -j 1 -d 4 mdrun_mpi -npme 128 -dlb
> yes
> > -pin on -resethway -noconfout -v
> > which yielded 89.727 ns/day. Log: http://pastebin.com/Ru0rR3L2
> >
> > I don't think I can do eight threads per rank, because that implies one
> > rank per node, which leads to PME nodes not using the GPU and
> complaining.
> >
> > Looking over the logs, I didn't intend for my original commands (in the
> > simulations in my first email) to use two OpenMP threads per rank. I
> > actually thought I was instructing it *not* to do that by using the "-d"
> > parameter to aprun. Were those two threads running on the same core? Or
> > were they running on consecutive cores, which would then fight for the
> > shared APU?
> >
> > Hello Mark, thank you for the information... all the details regarding
> > threads and ranks have been making my head spin. I tried running the same
> > command (the one that got 110 ns/day) with dlb turned off:
> > CRAY_CUDA_PROXY=1 aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 000000
> > -npme 256 -dlb no -pin on -resethway -noconfout -v
> > which yielded 97.003 ns/day. Log: http://pastebin.com/73asZA6P
> >
> > I understand how to manually balance the gpu_id parameter with the npme
> > parameter, but if I let Gromacs try to determine the right number of PME
> > ranks, then the gpu_id parameter needs to vary from node-to-node in a way
> > that I can't predict before running the simulation (since I don't know
> what
> > Gromacs will decide is the optimum number of PME ranks). Do I have a
> > fundamental misunderstanding?
> >
> > I understand that ntomp and the environment variable OMP_NUM_THREADS are
> > exactly interchangeable (is that correct?). I haven't been touching
> > ntomp_pme or GMX_PME_NUM_THREADS, should I be? You mention 1, 2, 3, and
> > 6... where do these numbers come from? One and two make sense to me, but
> I
> > do not understand what to expect by using three or six (which means I may
> > not really understand one or two either).
> >
> > My greatest concern is the variability between running identical
> > simulations on identical hardware... It doesn't make a lot of sense to me
> > to test a lot of different options if my results can vary up to 35%. How
> > can I tell that a 10% improvement is not actually just an "upward
> > variation" on what should actually be poorer performance? 97 ns/day is
> > good, but shouldn't I be able to reproduce 110 ns/day somehow?
> >
> > Also, the system has around 206k atoms, rectangular box periodic
> > boundaries, roughly 17nm by 17nm by 8nm. It's a membrane simulation, so
> > there's some variability in the Z direction, but things are comparatively
> > uniform in the XY plane. Regarding the hardware, this link explains it
> > pretty well (additionally, there's one K20X Tesla GPU per node):
> > https://www.olcf.ornl.gov/kb_articles/xk7-cpu-description/
> >
> > Thank you so much for all of your help!
> > Mitchell Dorrell
> >
> >
> > On Thu, Apr 30, 2015 at 8:42 AM, Mark Abraham <mark.j.abraham at gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Some clues about the composition of the simulation system and hardware
> are
> >> needed to understand the full picture.
> >>
> >> On Thu, Apr 30, 2015 at 2:23 PM Mitchell Dorrell <mwd at udel.edu> wrote:
> >>
> >> > Hi Carsten, thank you very much for getting back to me. I'm already
> >> using a
> >> > couple of the parameters you suggest. These are the options I was
> using:
> >> >
> >> > mdrun_mpi -gpu_id 000000 -npme 256 -dlb yes -pin on -resethway
> >> -noconfout
> >> > -v
> >> >
> >> > which is being called through:
> >> >
> >> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi <mdrun options as shown above>
> >> >
> >> > Did you mean "try switching on dynamic load balancing" or "try
> switching
> >> > off dynamic load balancing"?
> >> >
> >>
> >> It'll generally turn itself on by default, so try switching it off.
> >>
> >> I'll try three PME ranks per node instead of two and attach the results
> >> > later today. I would like to let Gromacs decide, but I can't figure
> out
> >> how
> >> > to do that without getting mismatches between GPU numbers and PP
> ranks.
> >> >
> >>
> >> There's one GPU per node, so all you have to manage is the length of the
> >> -gpu_id string of zeroes.
> >>
> >>
> >> > Can you suggest a command line for additional OpenMP threads with
> fewer
> >> MPI
> >> > ranks? I've attempted it myself, and I always seem to suffer major
> >> > performance losses. I suspect I'm setting something incorrectly. I'll
> >> run
> >> > it again with my most recent parameters and reply with those results
> as
> >> > well.
> >> >
> >>
> >> This is where hardware details become important. GROMACS use of OpenMP
> >> works OK so long as the set of cores that run the threads are as close
> >> together as possible. On your AMD hardware, you will get a stepwise
> >> degradation of performance as the number of OpenMP threads grows,
> because
> >> the level of memory cache through which they interact gets further away
> >> from the processor. Cores that do not even share L3 cache are just not
> >> worth using as part of the same OpenMP set. (This is less severe on
> Intel
> >> processors.) Some background info here
> >> http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29. So I
> >> would
> >> expect that -ntomp 1,2,3,6 are all that are worth trying for you. The
> >> problem is that the PP (ie GPU-supported) side of things would prefer as
> >> many OpenMP threads as possible, but the other parts of the code need a
> >> lot
> >> more dev work before they can handle that well...
> >>
> >> Mark
> >>
> >> Thank you for your assistance,
> >> > Mitchell Dorrell
> >> > On Apr 30, 2015 1:10 AM, "Kutzner, Carsten" <ckutzne at gwdg.de> wrote:
> >> >
> >> > > Hi Mitchell,
> >> > >
> >> > > > On 30 Apr 2015, at 03:04, Mitchell Dorrell <mwd at udel.edu> wrote:
> >> > > >
> >> > > > Hi all, I just ran the same simulation twice (ignore the
> difference
> >> in
> >> > > > filenames), and got very different results. Obviously, I'd like to
> >> > > > reproduce the faster simulation. I expect this probably has to do
> >> with
> >> > > > automatically-tuned parameters, but I'm not sure what is really
> >> going
> >> > on.
> >> > > >
> >> > > > Links to the two log files:
> >> > > > http://pastebin.com/Wq3uAyUv
> >> > > > http://pastebin.com/sU5Qhs5h
> >> > > >
> >> > > > Also, any tips for further boosting the performance?
> >> > > there are several things you can try out here.
> >> > >
> >> > > From the log files I see that in one case the tuning of PME grid
> >> spacing
> >> > > versus Coulomb cutoff goes on until after time step 2000. During
> this
> >> > > tuning
> >> > > time, performance varies a lot and can be far from optimal. To get
> >> > > trustworthy performance numbers at the end of your run, you should
> >> > > therefore
> >> > > exclude the first 3000 steps or so from the performance measurement
> >> using
> >> > > the -resetstep 3000 command line parameter to mdrun (or,
> >> alternatively,
> >> > > -resethway).
> >> > >
> >> > > Another issue that is present in both simulations is that your 256
> PME
> >> > > nodes
> >> > > have way more work to do than the PP (direct-space) nodes. So you
> >> could
> >> > > try to increase
> >> > > the fraction of PME nodes. Or try switching of dynamic load
> >> balancing, so
> >> > > that the Coulomb cutoff can grow larger (and the PME grid smaller)
> >> during
> >> > > PME/PP load balancing. With your system you seem to be near the
> >> > > parallelization
> >> > > limit, so it should help to use more than 2 OpenMP threads per MPI
> >> rank
> >> > > (thus a smaller number of MPI ranks). Try 4 or more, the reduced
> >> number
> >> > > of PME MPI ranks will greatly reduce the PME communication. At the
> >> same
> >> > > time, the
> >> > > PP domains will be larger, which is beneficial for load balancing.
> >> > >
> >> > > Carsten
> >> > >
> >> > >
> >> > > >
> >> > > > Thank you for your help!
> >> > > > Mitchell Dorrell
> >> > > > --
> >> > > > Gromacs Users mailing list
> >> > > >
> >> > > > * Please search the archive at
> >> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > > posting!
> >> > > >
> >> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > > >
> >> > > > * For (un)subscribe requests visit
> >> > > >
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >> or
> >> > > send a mail to gmx-users-request at gromacs.org.
> >> > >
> >> > > --
> >> > > Dr. Carsten Kutzner
> >> > > Max Planck Institute for Biophysical Chemistry
> >> > > Theoretical and Computational Biophysics
> >> > > Am Fassberg 11, 37077 Goettingen, Germany
> >> > > Tel. +49-551-2012313, Fax: +49-551-2012302
> >> > > http://www.mpibpc.mpg.de/grubmueller/kutzner
> >> > > http://www.mpibpc.mpg.de/grubmueller/sppexa
> >> > >
> >> > > --
> >> > > Gromacs Users mailing list
> >> > >
> >> > > * Please search the archive at
> >> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > > posting!
> >> > >
> >> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > >
> >> > > * For (un)subscribe requests visit
> >> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >> > > send a mail to gmx-users-request at gromacs.org.
> >> > >
> >> > --
> >> > Gromacs Users mailing list
> >> >
> >> > * Please search the archive at
> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> > posting!
> >> >
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >
> >> > * For (un)subscribe requests visit
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> > send a mail to gmx-users-request at gromacs.org.
> >> >
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-request at gromacs.org.
> >>
> >
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.