[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??

Wed Oct 1 14:41:17 CEST 2008

Berk Hess wrote:
> 
> Hi,
> 
> Your PME nodes seem to be running one order of magnitude slower than 
> they should.
> This could be explained by a memory usage problem, which is indicated by 
> the out
> of memory error.
> 
> I am running systems on 8 cores for 24 hours and the memory usage stays 
> constant
> after the first few steps.
> I have no clue what the problem could be.
> I am also looking into a dynamic load balancing problem which only seems 
> to happen
> on the Cray XT4 and for which I, up till now, also have no clue what 
> could cause this.

Actually I have experienced similar things. I'm now running a 4000 atom 
system on 8 cores in a single node, and each core has 425 Mb allocated 
after 30 minutes. The resident memory (RSS according to top) is only 12 
Mb which looks much more reasonable. Is there an easy way to test for 
memory leaks?

> 
> What compiler (and version) are your using?
> 
> Berk
> 
> 
>  > Date: Tue, 30 Sep 2008 18:22:44 +0200
>  > From: st01397 at student.uib.no
>  > To: gmx-users at gromacs.org
>  > Subject: RE: [gmx-users] Possible bug in parallelization, PME or 
> load-balancing on Gromacs 4.0_rc1 ??
>  > CC: gmx3 at hotmail.com
>  >
>  > I have some (hopefully) clarifying commments to my previous post now:
>  >
>  > First to answer your question regarding pme.c. My compilation was done
>  > from v. 1.125
>  > ------------
>  > Line 1037-
>  > if ((kx>0) || (ky>0)) {
>  > kzstart = 0;
>  > } else {
>  > kzstart = 1;
>  > p0++;
>  > }
>  > ------
>  > As you can see the p0++; line is there.
>  >
>  > Now here are some additional points:
>  >
>  > On Mon, 29 Sep 2008, Bjørn Steen Sæthre wrote:
>  >
>  > > The only Error message I can find is the rather cryptic::
>  > >
>  > > NOTE: Turning on dynamic load balancing
>  > >
>  > > _pmii_daemon(SIGCHLD): PE 4 exit signal Killed
>  > > [NID 1412]Apid 159787: initiated application termination
>  > >
>  > > There are no error's apart from that.
>  >
>  > > Furthermore I can now report that this error is endemic in all my sims
>  > > using harmonic position restraints in GROMACS 4.0_beta1 and GMX
>  > > 4.0_rc1.
>  > >
>  > > About core dumps. I will talk to our HPC staff, and get back to you 
> with
>  > > something more substantial I hope.
>  > >>
>  >
>  > OK, I have gotten some info from our HPC staff, they checked another 
> job of
>  > mine which crashed in the exact same fashion, with the exact same 
> starting
>  > run-topology and node configuration.
>  > They found some more info in the admin's log:
>  >
>  > > Hi,
>  > > this job got an OOM (out of memory), which is only recorded in the
>  > > system logs, not available directly to users:
>  >
>  > > [2008-09-29 17:18:18][c11-0c0s1n0]Out of memory: Killed process 8888
>  > > (parmdrun).
>  >
>  > I can also add that I have been able to stabilize the engine, by 
> altering the
>  > cut-offs and lowering the total PME-load of the run, at the expense 
> of far
>  > greater computational inefficiency.
>  >
>  > That is I went from unstable < to stable > as in the following diff on
>  > the mdp-files:
>  > -----------------------------
>  > 21c21
>  > < rlist = 0.9
>  > ---
>  > > rlist = 1.0
>  > 24c24
>  > < rcoulomb = 0.9
>  > ---
>  > > rcoulomb = 1.0
>  > 26c26
>  > < rvdw = 0.9
>  > ---
>  > > rvdw = 1.0
>  > 28,30c28,31
>  > < fourier_nx = 60
>  > < fourier_ny = 40
>  > < fourier_nz = 40
>  > ---
>  > > fourier_nx = 48
>  > > fourier_ny = 32
>  > > fourier_nz = 32
>  > 35c36
>  > ------------------------------
>  > That is, the PME-workload went from 1/2 of nodes to 1/3 of them since 
> I was
>  > using exactly the same startup configuration ---------------------
>  >
>  > This however, while enhancing stability, the output rate slowed down
>  > appreciably. And as shown in the log output, the reason is clear:
>  > ------------------------------------------------------------
>  > Making 2D domain decomposition 8 x 4 x 1
>  > starting mdrun 'Propane-hydrate prism (2x2x3 UC)'
>  > 2000000 steps, 4000.0 ps.
>  > Step 726095: Run time exceeded 3.960 hours, will terminate the run
>  >
>  > Step 726100: Run time exceeded 3.960 hours, will terminate the run
>  >
>  > Average load imbalance: 26.7 %
>  > Part of the total run time spent waiting due to load imbalance: 1.5 %
>  > Average PME mesh/force load: 9.369
>  > Part of the total run time spent waiting due to PP/PME imbalance: 57.5 %
>  >
>  > NOTE: 57.5 % performance was lost because the PME nodes
>  > had more work to do than the PP nodes.
>  > You might want to increase the number of PME nodes
>  > or increase the cut-off and the grid spacing.
>  >
>  >
>  > Parallel run - timing based on wallclock.
>  >
>  > NODE (s) Real (s) (%)
>  > Time: 5703.000 5703.000 100.0
>  > 1h35:03
>  > (Mnbf/s) (GFlops) (ns/day) (hour/ns)
>  > Performance: 29.593 8.566 60.600 0.396
>  >
>  > gcq#0: Thanx for Using GROMACS - Have a Nice Day
>  > -----------------------------------------------
>  >
>  >
>  > One thing more is odd here though.
>  > In the startup script I allocated 4 hours, and set -maxh 4:
>  >
>  > -----------------------------------------------
>  > #PBS -l walltime=4:00:00,mppwidth=48,mppnppn=4
>  > cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K_2nd
>  > source $HOME/gmx_latest_290908/bin/GMXRC
>  > aprun -n 48 parmdrun -s topol.tpr -maxh 4 -npme 16
>  > exit $?
>  > -----------------------
>  >
>  > why the wallclock inconsistency (ie. wallclock is 1:35:03 which does not
>  > correspond to the note of 3.96 hours exceeded.)
>  >
>  >
>  >
>  > I hope this is helpful in resolving the issue brought up originally. 
> (Might
>  > there be a possible memory leak somewhere?)
>  >
>  > Regards
>  > Bjørn
>  >
>  >
>  > PhD-student
>  > Insitute of Physics & Tech.- University of Bergen
>  > Allegt. 55,
>  > 5007 Bergen
>  > Norway
>  >
>  > Tel(office): +47 55582869
>  > Cell: +47 99253386
>  > _______________________________________________
>  > gmx-users mailing list gmx-users at gromacs.org
>  > http://www.gromacs.org/mailman/listinfo/gmx-users
>  > Please search the archive at http://www.gromacs.org/search before 
> posting!
>  > Please don't post (un)subscribe requests to the list. Use the
>  > www interface or send it to gmx-users-request at gromacs.org.
>  > Can't post? Read http://www.gromacs.org/mailing_lists/users.php
> 
> Express yourself instantly with MSN Messenger! MSN Messenger 
> <http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php

-- 
David van der Spoel, Ph.D., Professor of Biology
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se