[gmx-users] 35% Variability in Simulation Performance

Fri May 8 20:45:31 CEST 2015

Brief comments/tips, I hope I did not miss essential parts of the
previous discussion :)

First of all, performance consistency on Cray's Gemini network is
known to be affected by job placement - that'y why Blue Waters has a
locality-aware scheduler which can place jobs reasonably
compact/contiguous. TITAN does not have it and it is also huge, so
performance consistency, even with small-ish jobs is a matter of luck
- unless you ask specific nodes from the scheduler.

Secondly, you are running at quite high parallelization, ~160
atoms/core. At this scale external factors like job placement and
interfering load affects GROMACS; in particular PME performance will
vary/suffer greatly. Runs at this scale (esp. on busy machine with
non-ideal job placement) will most often be PME-bound. Look at the
PME-PP balance balance ("PME mesh/force load" ratio); in your case it
is 1.4-2.2 meaning that your PME load is far higher than the PP load
so you should work on reducing it.

Tips to get better performance:
- use 1-1 PP-PME ratio, typically that gives best performance on such machines
- consider pme-order=5 iso 4 (and coarser FFT grid) - this should
improve PME performance;
- try tweaking the LINCS settings (e.g. order=3 iter=2 should work) -
this could improve load balance

To sum it up, if you want reach the 110 ns/day (perhaps even more),
try reducing the PME load. Monitor the PP-PME load balance while
tweaking setting as well as fluctuation (if you run with -v you'll get
it printed to stderr, you can control its frequency with -stepout).

Cheers,
--
Szilárd

On Wed, May 6, 2015 at 6:05 AM, Christopher Neale
<chris.neale at alum.utoronto.ca> wrote:
> I didn't read the whole thread, just chiming in based on the title.
>
> I can see up to 3x differences in performance when using clusters that don't pack nodes. If you're using a cluster that is set up this way, then it can make a huge difference how may cores you get on each node and how many switches are involved.
>
> On one cluster with 4X QDR non-blocking connections giving a 40 Gbit/s signal rate with a 32 Gbit/s data rate (apparently 1:1 QDR IB, but I'm just copying and pasting specs here) then it may not matter too much (probably less than 5% variation in my experience). However, if I use a similar cluster that instead has  a 2:1 IB blocking factor then I see massive swings in performance unless I pack nodes. You might want to talk to your sys admins about setting up a queue reservation.
>
> Chris.
>
> ________________________________________
> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se <gromacs.org_gmx-users-bounces at maillist.sys.kth.se> on behalf of Mark Abraham <mark.j.abraham at gmail.com>
> Sent: 05 May 2015 14:38
> To: gmx-users at gromacs.org
> Subject: Re: [gmx-users] 35% Variability in Simulation Performance
>
> Hi,
>
> Sometimes you can't reproduce performance. The network is a shared resource
> on almost all HPC platforms, and if someone else is using it, you're just
> going to wait. Even if you fix all the mdrun parameters, your performance
> depends on the properties of what else is running, and how close the nodes
> you are allocated are to each other. Details vary a lot, but you should ask
> your sysadmins to let you try some runs when the machine is quiet (e.g.
> before or after scheduled maintenance). You likely want all your nodes
> within the same chunk of network,  and probably there will be no way to ask
> for that from the batch system. Sucks to be us.
>
> Mark
>
> On Thu, 30 Apr 2015 20:25 Mitchell Dorrell <mwd at udel.edu> wrote:
>
>> I just tried a number of simulations with -notunepme, after having set the
>> grid size in my mdp file and generating a new tpr. These are the lines I
>> changed:
>> rlist                   = 1.336
>> rcoulomb                = 1.336
>> rvdw                    = 1.336
>> and these are the lines I added:
>> fourierspacing           = 0.134
>> fourier-nx               = 120
>> fourier-ny               = 120
>> fourier-nz               = 60
>>
>> I never broke 100ns/day, though. The notunepme parameter does seem to
>> improve consistency from one simulation to the next, however.
>>
>> How do I repeat 110ns/day?
>>
>> Thanks,
>> Mitchell Dorrell
>>
>>
>> On Thu, Apr 30, 2015 at 12:28 PM, Mitchell Dorrell <mwd at udel.edu> wrote:
>>
>> > Hello again, using three PME ranks per node did better, but still didn't
>> > reach the 110 ns/day that the "fast" simulation achieved.
>> > I ran it like this:
>> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 00000 -npme 384 -dlb yes
>> > -pin on -resethway -noconfout -v
>> > and got 87.752 ns/day. Here is the full log:
>> http://pastebin.com/mtyrQSXH
>> >
>> > I also tried two threads:
>> > OMP_NUM_THREADS=2 aprun -n 512 -S 2 -j 1 -d 2 mdrun_mpi -gpu_id 000 -npme
>> > 128 -dlb yes -pin on -resethway -noconfout -v
>> > which yielded 79.844 ns/day. Log: http://pastebin.com/vQkT9xY0
>> >
>> > and four threads, but with only two ranks per node, I can't maintain the
>> > ratio of one PME rank to every three PP ranks, so I went half-and-half
>> > instead:
>> > OMP_NUM_THREADS=4 aprun -n 256 -S 4 -j 1 -d 4 mdrun_mpi -npme 128 -dlb
>> yes
>> > -pin on -resethway -noconfout -v
>> > which yielded 89.727 ns/day. Log: http://pastebin.com/Ru0rR3L2
>> >
>> > I don't think I can do eight threads per rank, because that implies one
>> > rank per node, which leads to PME nodes not using the GPU and
>> complaining.
>> >
>> > Looking over the logs, I didn't intend for my original commands (in the
>> > simulations in my first email) to use two OpenMP threads per rank. I
>> > actually thought I was instructing it *not* to do that by using the "-d"
>> > parameter to aprun. Were those two threads running on the same core? Or
>> > were they running on consecutive cores, which would then fight for the
>> > shared APU?
>> >
>> > Hello Mark, thank you for the information... all the details regarding
>> > threads and ranks have been making my head spin. I tried running the same
>> > command (the one that got 110 ns/day) with dlb turned off:
>> > CRAY_CUDA_PROXY=1 aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 000000
>> > -npme 256 -dlb no -pin on -resethway -noconfout -v
>> > which yielded 97.003 ns/day. Log: http://pastebin.com/73asZA6P
>> >
>> > I understand how to manually balance the gpu_id parameter with the npme
>> > parameter, but if I let Gromacs try to determine the right number of PME
>> > ranks, then the gpu_id parameter needs to vary from node-to-node in a way
>> > that I can't predict before running the simulation (since I don't know
>> what
>> > Gromacs will decide is the optimum number of PME ranks). Do I have a
>> > fundamental misunderstanding?
>> >
>> > I understand that ntomp and the environment variable OMP_NUM_THREADS are
>> > exactly interchangeable (is that correct?). I haven't been touching
>> > ntomp_pme or GMX_PME_NUM_THREADS, should I be? You mention 1, 2, 3, and
>> > 6... where do these numbers come from? One and two make sense to me, but
>> I
>> > do not understand what to expect by using three or six (which means I may
>> > not really understand one or two either).
>> >
>> > My greatest concern is the variability between running identical
>> > simulations on identical hardware... It doesn't make a lot of sense to me
>> > to test a lot of different options if my results can vary up to 35%. How
>> > can I tell that a 10% improvement is not actually just an "upward
>> > variation" on what should actually be poorer performance? 97 ns/day is
>> > good, but shouldn't I be able to reproduce 110 ns/day somehow?
>> >
>> > Also, the system has around 206k atoms, rectangular box periodic
>> > boundaries, roughly 17nm by 17nm by 8nm. It's a membrane simulation, so
>> > there's some variability in the Z direction, but things are comparatively
>> > uniform in the XY plane. Regarding the hardware, this link explains it
>> > pretty well (additionally, there's one K20X Tesla GPU per node):
>> > https://www.olcf.ornl.gov/kb_articles/xk7-cpu-description/
>> >
>> > Thank you so much for all of your help!
>> > Mitchell Dorrell
>> >
>> >
>> > On Thu, Apr 30, 2015 at 8:42 AM, Mark Abraham <mark.j.abraham at gmail.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> Some clues about the composition of the simulation system and hardware
>> are
>> >> needed to understand the full picture.
>> >>
>> >> On Thu, Apr 30, 2015 at 2:23 PM Mitchell Dorrell <mwd at udel.edu> wrote:
>> >>
>> >> > Hi Carsten, thank you very much for getting back to me. I'm already
>> >> using a
>> >> > couple of the parameters you suggest. These are the options I was
>> using:
>> >> >
>> >> > mdrun_mpi -gpu_id 000000 -npme 256 -dlb yes -pin on -resethway
>> >> -noconfout
>> >> > -v
>> >> >
>> >> > which is being called through:
>> >> >
>> >> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi <mdrun options as shown above>
>> >> >
>> >> > Did you mean "try switching on dynamic load balancing" or "try
>> switching
>> >> > off dynamic load balancing"?
>> >> >
>> >>
>> >> It'll generally turn itself on by default, so try switching it off.
>> >>
>> >> I'll try three PME ranks per node instead of two and attach the results
>> >> > later today. I would like to let Gromacs decide, but I can't figure
>> out
>> >> how
>> >> > to do that without getting mismatches between GPU numbers and PP
>> ranks.
>> >> >
>> >>
>> >> There's one GPU per node, so all you have to manage is the length of the
>> >> -gpu_id string of zeroes.
>> >>
>> >>
>> >> > Can you suggest a command line for additional OpenMP threads with
>> fewer
>> >> MPI
>> >> > ranks? I've attempted it myself, and I always seem to suffer major
>> >> > performance losses. I suspect I'm setting something incorrectly. I'll
>> >> run
>> >> > it again with my most recent parameters and reply with those results
>> as
>> >> > well.
>> >> >
>> >>
>> >> This is where hardware details become important. GROMACS use of OpenMP
>> >> works OK so long as the set of cores that run the threads are as close
>> >> together as possible. On your AMD hardware, you will get a stepwise
>> >> degradation of performance as the number of OpenMP threads grows,
>> because
>> >> the level of memory cache through which they interact gets further away
>> >> from the processor. Cores that do not even share L3 cache are just not
>> >> worth using as part of the same OpenMP set. (This is less severe on
>> Intel
>> >> processors.) Some background info here
>> >> http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29. So I
>> >> would
>> >> expect that -ntomp 1,2,3,6 are all that are worth trying for you. The
>> >> problem is that the PP (ie GPU-supported) side of things would prefer as
>> >> many OpenMP threads as possible, but the other parts of the code need a
>> >> lot
>> >> more dev work before they can handle that well...
>> >>
>> >> Mark
>> >>
>> >> Thank you for your assistance,
>> >> > Mitchell Dorrell
>> >> > On Apr 30, 2015 1:10 AM, "Kutzner, Carsten" <ckutzne at gwdg.de> wrote:
>> >> >
>> >> > > Hi Mitchell,
>> >> > >
>> >> > > > On 30 Apr 2015, at 03:04, Mitchell Dorrell <mwd at udel.edu> wrote:
>> >> > > >
>> >> > > > Hi all, I just ran the same simulation twice (ignore the
>> difference
>> >> in
>> >> > > > filenames), and got very different results. Obviously, I'd like to
>> >> > > > reproduce the faster simulation. I expect this probably has to do
>> >> with
>> >> > > > automatically-tuned parameters, but I'm not sure what is really
>> >> going
>> >> > on.
>> >> > > >
>> >> > > > Links to the two log files:
>> >> > > > http://pastebin.com/Wq3uAyUv
>> >> > > > http://pastebin.com/sU5Qhs5h
>> >> > > >
>> >> > > > Also, any tips for further boosting the performance?
>> >> > > there are several things you can try out here.
>> >> > >
>> >> > > From the log files I see that in one case the tuning of PME grid
>> >> spacing
>> >> > > versus Coulomb cutoff goes on until after time step 2000. During
>> this
>> >> > > tuning
>> >> > > time, performance varies a lot and can be far from optimal. To get
>> >> > > trustworthy performance numbers at the end of your run, you should
>> >> > > therefore
>> >> > > exclude the first 3000 steps or so from the performance measurement
>> >> using
>> >> > > the -resetstep 3000 command line parameter to mdrun (or,
>> >> alternatively,
>> >> > > -resethway).
>> >> > >
>> >> > > Another issue that is present in both simulations is that your 256
>> PME
>> >> > > nodes
>> >> > > have way more work to do than the PP (direct-space) nodes. So you
>> >> could
>> >> > > try to increase
>> >> > > the fraction of PME nodes. Or try switching of dynamic load
>> >> balancing, so
>> >> > > that the Coulomb cutoff can grow larger (and the PME grid smaller)
>> >> during
>> >> > > PME/PP load balancing. With your system you seem to be near the
>> >> > > parallelization
>> >> > > limit, so it should help to use more than 2 OpenMP threads per MPI
>> >> rank
>> >> > > (thus a smaller number of MPI ranks). Try 4 or more, the reduced
>> >> number
>> >> > > of PME MPI ranks will greatly reduce the PME communication. At the
>> >> same
>> >> > > time, the
>> >> > > PP domains will be larger, which is beneficial for load balancing.
>> >> > >
>> >> > > Carsten
>> >> > >
>> >> > >
>> >> > > >
>> >> > > > Thank you for your help!
>> >> > > > Mitchell Dorrell
>> >> > > > --
>> >> > > > Gromacs Users mailing list
>> >> > > >
>> >> > > > * Please search the archive at
>> >> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> >> > > posting!
>> >> > > >
>> >> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> > > >
>> >> > > > * For (un)subscribe requests visit
>> >> > > >
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> >> or
>> >> > > send a mail to gmx-users-request at gromacs.org.
>> >> > >
>> >> > > --
>> >> > > Dr. Carsten Kutzner
>> >> > > Max Planck Institute for Biophysical Chemistry
>> >> > > Theoretical and Computational Biophysics
>> >> > > Am Fassberg 11, 37077 Goettingen, Germany
>> >> > > Tel. +49-551-2012313, Fax: +49-551-2012302
>> >> > > http://www.mpibpc.mpg.de/grubmueller/kutzner
>> >> > > http://www.mpibpc.mpg.de/grubmueller/sppexa
>> >> > >
>> >> > > --
>> >> > > Gromacs Users mailing list
>> >> > >
>> >> > > * Please search the archive at
>> >> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> >> > > posting!
>> >> > >
>> >> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> > >
>> >> > > * For (un)subscribe requests visit
>> >> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>> >> > > send a mail to gmx-users-request at gromacs.org.
>> >> > >
>> >> > --
>> >> > Gromacs Users mailing list
>> >> >
>> >> > * Please search the archive at
>> >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> >> > posting!
>> >> >
>> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >
>> >> > * For (un)subscribe requests visit
>> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> >> > send a mail to gmx-users-request at gromacs.org.
>> >> >
>> >> --
>> >> Gromacs Users mailing list
>> >>
>> >> * Please search the archive at
>> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> >> posting!
>> >>
>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >>
>> >> * For (un)subscribe requests visit
>> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> >> send a mail to gmx-users-request at gromacs.org.
>> >>
>> >
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.