[gmx-users] 35% Variability in Simulation Performance

Thu Apr 30 20:25:20 CEST 2015

I just tried a number of simulations with -notunepme, after having set the
grid size in my mdp file and generating a new tpr. These are the lines I
changed:
rlist                   = 1.336
rcoulomb                = 1.336
rvdw                    = 1.336
and these are the lines I added:
fourierspacing           = 0.134
fourier-nx               = 120
fourier-ny               = 120
fourier-nz               = 60

I never broke 100ns/day, though. The notunepme parameter does seem to
improve consistency from one simulation to the next, however.

How do I repeat 110ns/day?

Thanks,
Mitchell Dorrell

On Thu, Apr 30, 2015 at 12:28 PM, Mitchell Dorrell <mwd at udel.edu> wrote:

> Hello again, using three PME ranks per node did better, but still didn't
> reach the 110 ns/day that the "fast" simulation achieved.
> I ran it like this:
> aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 00000 -npme 384 -dlb yes
> -pin on -resethway -noconfout -v
> and got 87.752 ns/day. Here is the full log: http://pastebin.com/mtyrQSXH
>
> I also tried two threads:
> OMP_NUM_THREADS=2 aprun -n 512 -S 2 -j 1 -d 2 mdrun_mpi -gpu_id 000 -npme
> 128 -dlb yes -pin on -resethway -noconfout -v
> which yielded 79.844 ns/day. Log: http://pastebin.com/vQkT9xY0
>
> and four threads, but with only two ranks per node, I can't maintain the
> ratio of one PME rank to every three PP ranks, so I went half-and-half
> instead:
> OMP_NUM_THREADS=4 aprun -n 256 -S 4 -j 1 -d 4 mdrun_mpi -npme 128 -dlb yes
> -pin on -resethway -noconfout -v
> which yielded 89.727 ns/day. Log: http://pastebin.com/Ru0rR3L2
>
> I don't think I can do eight threads per rank, because that implies one
> rank per node, which leads to PME nodes not using the GPU and complaining.
>
> Looking over the logs, I didn't intend for my original commands (in the
> simulations in my first email) to use two OpenMP threads per rank. I
> actually thought I was instructing it *not* to do that by using the "-d"
> parameter to aprun. Were those two threads running on the same core? Or
> were they running on consecutive cores, which would then fight for the
> shared APU?
>
> Hello Mark, thank you for the information... all the details regarding
> threads and ranks have been making my head spin. I tried running the same
> command (the one that got 110 ns/day) with dlb turned off:
> CRAY_CUDA_PROXY=1 aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi -gpu_id 000000
> -npme 256 -dlb no -pin on -resethway -noconfout -v
> which yielded 97.003 ns/day. Log: http://pastebin.com/73asZA6P
>
> I understand how to manually balance the gpu_id parameter with the npme
> parameter, but if I let Gromacs try to determine the right number of PME
> ranks, then the gpu_id parameter needs to vary from node-to-node in a way
> that I can't predict before running the simulation (since I don't know what
> Gromacs will decide is the optimum number of PME ranks). Do I have a
> fundamental misunderstanding?
>
> I understand that ntomp and the environment variable OMP_NUM_THREADS are
> exactly interchangeable (is that correct?). I haven't been touching
> ntomp_pme or GMX_PME_NUM_THREADS, should I be? You mention 1, 2, 3, and
> 6... where do these numbers come from? One and two make sense to me, but I
> do not understand what to expect by using three or six (which means I may
> not really understand one or two either).
>
> My greatest concern is the variability between running identical
> simulations on identical hardware... It doesn't make a lot of sense to me
> to test a lot of different options if my results can vary up to 35%. How
> can I tell that a 10% improvement is not actually just an "upward
> variation" on what should actually be poorer performance? 97 ns/day is
> good, but shouldn't I be able to reproduce 110 ns/day somehow?
>
> Also, the system has around 206k atoms, rectangular box periodic
> boundaries, roughly 17nm by 17nm by 8nm. It's a membrane simulation, so
> there's some variability in the Z direction, but things are comparatively
> uniform in the XY plane. Regarding the hardware, this link explains it
> pretty well (additionally, there's one K20X Tesla GPU per node):
> https://www.olcf.ornl.gov/kb_articles/xk7-cpu-description/
>
> Thank you so much for all of your help!
> Mitchell Dorrell
>
>
> On Thu, Apr 30, 2015 at 8:42 AM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
>
>> Hi,
>>
>> Some clues about the composition of the simulation system and hardware are
>> needed to understand the full picture.
>>
>> On Thu, Apr 30, 2015 at 2:23 PM Mitchell Dorrell <mwd at udel.edu> wrote:
>>
>> > Hi Carsten, thank you very much for getting back to me. I'm already
>> using a
>> > couple of the parameters you suggest. These are the options I was using:
>> >
>> > mdrun_mpi -gpu_id 000000 -npme 256 -dlb yes -pin on -resethway
>> -noconfout
>> > -v
>> >
>> > which is being called through:
>> >
>> > aprun -n 1024 -S 4 -j 1 -d 1 mdrun_mpi <mdrun options as shown above>
>> >
>> > Did you mean "try switching on dynamic load balancing" or "try switching
>> > off dynamic load balancing"?
>> >
>>
>> It'll generally turn itself on by default, so try switching it off.
>>
>> I'll try three PME ranks per node instead of two and attach the results
>> > later today. I would like to let Gromacs decide, but I can't figure out
>> how
>> > to do that without getting mismatches between GPU numbers and PP ranks.
>> >
>>
>> There's one GPU per node, so all you have to manage is the length of the
>> -gpu_id string of zeroes.
>>
>>
>> > Can you suggest a command line for additional OpenMP threads with fewer
>> MPI
>> > ranks? I've attempted it myself, and I always seem to suffer major
>> > performance losses. I suspect I'm setting something incorrectly. I'll
>> run
>> > it again with my most recent parameters and reply with those results as
>> > well.
>> >
>>
>> This is where hardware details become important. GROMACS use of OpenMP
>> works OK so long as the set of cores that run the threads are as close
>> together as possible. On your AMD hardware, you will get a stepwise
>> degradation of performance as the number of OpenMP threads grows, because
>> the level of memory cache through which they interact gets further away
>> from the processor. Cores that do not even share L3 cache are just not
>> worth using as part of the same OpenMP set. (This is less severe on Intel
>> processors.) Some background info here
>> http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29. So I
>> would
>> expect that -ntomp 1,2,3,6 are all that are worth trying for you. The
>> problem is that the PP (ie GPU-supported) side of things would prefer as
>> many OpenMP threads as possible, but the other parts of the code need a
>> lot
>> more dev work before they can handle that well...
>>
>> Mark
>>
>> Thank you for your assistance,
>> > Mitchell Dorrell
>> > On Apr 30, 2015 1:10 AM, "Kutzner, Carsten" <ckutzne at gwdg.de> wrote:
>> >
>> > > Hi Mitchell,
>> > >
>> > > > On 30 Apr 2015, at 03:04, Mitchell Dorrell <mwd at udel.edu> wrote:
>> > > >
>> > > > Hi all, I just ran the same simulation twice (ignore the difference
>> in
>> > > > filenames), and got very different results. Obviously, I'd like to
>> > > > reproduce the faster simulation. I expect this probably has to do
>> with
>> > > > automatically-tuned parameters, but I'm not sure what is really
>> going
>> > on.
>> > > >
>> > > > Links to the two log files:
>> > > > http://pastebin.com/Wq3uAyUv
>> > > > http://pastebin.com/sU5Qhs5h
>> > > >
>> > > > Also, any tips for further boosting the performance?
>> > > there are several things you can try out here.
>> > >
>> > > From the log files I see that in one case the tuning of PME grid
>> spacing
>> > > versus Coulomb cutoff goes on until after time step 2000. During this
>> > > tuning
>> > > time, performance varies a lot and can be far from optimal. To get
>> > > trustworthy performance numbers at the end of your run, you should
>> > > therefore
>> > > exclude the first 3000 steps or so from the performance measurement
>> using
>> > > the -resetstep 3000 command line parameter to mdrun (or,
>> alternatively,
>> > > -resethway).
>> > >
>> > > Another issue that is present in both simulations is that your 256 PME
>> > > nodes
>> > > have way more work to do than the PP (direct-space) nodes. So you
>> could
>> > > try to increase
>> > > the fraction of PME nodes. Or try switching of dynamic load
>> balancing, so
>> > > that the Coulomb cutoff can grow larger (and the PME grid smaller)
>> during
>> > > PME/PP load balancing. With your system you seem to be near the
>> > > parallelization
>> > > limit, so it should help to use more than 2 OpenMP threads per MPI
>> rank
>> > > (thus a smaller number of MPI ranks). Try 4 or more, the reduced
>> number
>> > > of PME MPI ranks will greatly reduce the PME communication. At the
>> same
>> > > time, the
>> > > PP domains will be larger, which is beneficial for load balancing.
>> > >
>> > > Carsten
>> > >
>> > >
>> > > >
>> > > > Thank you for your help!
>> > > > Mitchell Dorrell
>> > > > --
>> > > > Gromacs Users mailing list
>> > > >
>> > > > * Please search the archive at
>> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > > posting!
>> > > >
>> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > > >
>> > > > * For (un)subscribe requests visit
>> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>> > > send a mail to gmx-users-request at gromacs.org.
>> > >
>> > > --
>> > > Dr. Carsten Kutzner
>> > > Max Planck Institute for Biophysical Chemistry
>> > > Theoretical and Computational Biophysics
>> > > Am Fassberg 11, 37077 Goettingen, Germany
>> > > Tel. +49-551-2012313, Fax: +49-551-2012302
>> > > http://www.mpibpc.mpg.de/grubmueller/kutzner
>> > > http://www.mpibpc.mpg.de/grubmueller/sppexa
>> > >
>> > > --
>> > > Gromacs Users mailing list
>> > >
>> > > * Please search the archive at
>> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > > posting!
>> > >
>> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > >
>> > > * For (un)subscribe requests visit
>> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > > send a mail to gmx-users-request at gromacs.org.
>> > >
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>
>