[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
Berk Hess
gmx3 at hotmail.com
Wed Oct 1 11:07:26 CEST 2008
Hi,
Your PME nodes seem to be running one order of magnitude slower than they should.
This could be explained by a memory usage problem, which is indicated by the out
of memory error.
I am running systems on 8 cores for 24 hours and the memory usage stays constant
after the first few steps.
I have no clue what the problem could be.
I am also looking into a dynamic load balancing problem which only seems to happen
on the Cray XT4 and for which I, up till now, also have no clue what could cause this.
What compiler (and version) are your using?
Berk
> Date: Tue, 30 Sep 2008 18:22:44 +0200
> From: st01397 at student.uib.no
> To: gmx-users at gromacs.org
> Subject: RE: [gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
> CC: gmx3 at hotmail.com
>
> I have some (hopefully) clarifying commments to my previous post now:
>
> First to answer your question regarding pme.c. My compilation was done
> from v. 1.125
> ------------
> Line 1037-
> if ((kx>0) || (ky>0)) {
> kzstart = 0;
> } else {
> kzstart = 1;
> p0++;
> }
> ------
> As you can see the p0++; line is there.
>
> Now here are some additional points:
>
> On Mon, 29 Sep 2008, Bjørn Steen Sæthre wrote:
>
> > The only Error message I can find is the rather cryptic::
> >
> > NOTE: Turning on dynamic load balancing
> >
> > _pmii_daemon(SIGCHLD): PE 4 exit signal Killed
> > [NID 1412]Apid 159787: initiated application termination
> >
> > There are no error's apart from that.
>
> > Furthermore I can now report that this error is endemic in all my sims
> > using harmonic position restraints in GROMACS 4.0_beta1 and GMX
> > 4.0_rc1.
> >
> > About core dumps. I will talk to our HPC staff, and get back to you with
> > something more substantial I hope.
> >>
>
> OK, I have gotten some info from our HPC staff, they checked another job of
> mine which crashed in the exact same fashion, with the exact same starting
> run-topology and node configuration.
> They found some more info in the admin's log:
>
> > Hi,
> > this job got an OOM (out of memory), which is only recorded in the
> > system logs, not available directly to users:
>
> > [2008-09-29 17:18:18][c11-0c0s1n0]Out of memory: Killed process 8888
> > (parmdrun).
>
> I can also add that I have been able to stabilize the engine, by altering the
> cut-offs and lowering the total PME-load of the run, at the expense of far
> greater computational inefficiency.
>
> That is I went from unstable < to stable > as in the following diff on
> the mdp-files:
> -----------------------------
> 21c21
> < rlist = 0.9
> ---
> > rlist = 1.0
> 24c24
> < rcoulomb = 0.9
> ---
> > rcoulomb = 1.0
> 26c26
> < rvdw = 0.9
> ---
> > rvdw = 1.0
> 28,30c28,31
> < fourier_nx = 60
> < fourier_ny = 40
> < fourier_nz = 40
> ---
> > fourier_nx = 48
> > fourier_ny = 32
> > fourier_nz = 32
> 35c36
> ------------------------------
> That is, the PME-workload went from 1/2 of nodes to 1/3 of them since I was
> using exactly the same startup configuration ---------------------
>
> This however, while enhancing stability, the output rate slowed down
> appreciably. And as shown in the log output, the reason is clear:
> ------------------------------------------------------------
> Making 2D domain decomposition 8 x 4 x 1
> starting mdrun 'Propane-hydrate prism (2x2x3 UC)'
> 2000000 steps, 4000.0 ps.
> Step 726095: Run time exceeded 3.960 hours, will terminate the run
>
> Step 726100: Run time exceeded 3.960 hours, will terminate the run
>
> Average load imbalance: 26.7 %
> Part of the total run time spent waiting due to load imbalance: 1.5 %
> Average PME mesh/force load: 9.369
> Part of the total run time spent waiting due to PP/PME imbalance: 57.5 %
>
> NOTE: 57.5 % performance was lost because the PME nodes
> had more work to do than the PP nodes.
> You might want to increase the number of PME nodes
> or increase the cut-off and the grid spacing.
>
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 5703.000 5703.000 100.0
> 1h35:03
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 29.593 8.566 60.600 0.396
>
> gcq#0: Thanx for Using GROMACS - Have a Nice Day
> -----------------------------------------------
>
>
> One thing more is odd here though.
> In the startup script I allocated 4 hours, and set -maxh 4:
>
> -----------------------------------------------
> #PBS -l walltime=4:00:00,mppwidth=48,mppnppn=4
> cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K_2nd
> source $HOME/gmx_latest_290908/bin/GMXRC
> aprun -n 48 parmdrun -s topol.tpr -maxh 4 -npme 16
> exit $?
> -----------------------
>
> why the wallclock inconsistency (ie. wallclock is 1:35:03 which does not
> correspond to the note of 3.96 hours exceeded.)
>
>
>
> I hope this is helpful in resolving the issue brought up originally. (Might
> there be a possible memory leak somewhere?)
>
> Regards
> Bjørn
>
>
> PhD-student
> Insitute of Physics & Tech.- University of Bergen
> Allegt. 55,
> 5007 Bergen
> Norway
>
> Tel(office): +47 55582869
> Cell: +47 99253386
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20081001/9365a276/attachment.html>
More information about the gromacs.org_gmx-users
mailing list