[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??

Wed Oct 1 13:25:10 CEST 2008

Hi,

The Cray XT4 has a torus network, but you don't get access to it as a torus.
You will get assigned processors which can be anywhere in the machine
and they are usually never in a nice cube, but there are always some missing.
Therefore software, such as Gromacs, can not make use of proper Cartesian
(torus) communication as one can for instance on a Blue Gene.

I have no clue about the wallclock issue.
Can you find out if the run took 1.35 or 4 hours?
The start time is somewhere at the beginning of the log file.

Berk

> Date: Wed, 1 Oct 2008 12:27:06 +0200
> From: Bjorn.Sathre at student.uib.no
> To: gmx3 at hotmail.com
> Subject: RE: [gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
> 
> 
> 
> On Wed, 1 Oct 2008, Berk Hess wrote:
> 
> > Hi,
> >
> > Your PME nodes seem to be running one order of magnitude slower than they should.
> > This could be explained by a memory usage problem, which is indicated by the out
> > of memory error.
> >
> > I am running systems on 8 cores for 24 hours and the memory usage stays constant
> > after the first few steps.
> > I have no clue what the problem could be.
> > I am also looking into a dynamic load balancing problem which only seems to happen
> > on the Cray XT4 and for which I, up till now, also have no clue what could cause this.
> >
> > What compiler (and version) are your using?
> 
> I am using gcc-4.2.0.quadcore (the first gcc optimized for newest AMD 
> Opterons (Barcelona) ). This is the same compiler as employed on the 
> "Louhi"  at CSC Finland. See:
> 
> http://developer.amd.com/cpu/gnu/Pages/default.aspx
> http://www.csc.fi/english/pages/louhi_guide/program_development/compilers/gcc/index_html
> 
> We have recently got an update to our MPI library, and I am now using 
> Cray's MPT-3.0.3 MPI Library (MPICH2-adaptation)
> 
> Do you have any comments on the wallclock issue I brought up in the 
> previous post??
> 
> One more thing:
> My current run having 7584 atoms, and a GCD of the # fourier-bins 
> in direct space of 16. Should then run stably on 48 CPU's. (12 nodes on 
> our Cray XT4)
> I understand we have a 3D-torus network on the machine.
> Why is the default DD of the 2D form 4x8x1  PP nodes, instead of the 3D 
> 4x4x2???
> Does it have to do with having 4CPU's per node perhaps??
> 
> Thanks for your efforts in clearing up the Cray XT4 issues.
> BjÃ¸rn
> 
> > Berk
> >
> >
> >> Date: Tue, 30 Sep 2008 18:22:44 +0200
> >> From: st01397 at student.uib.no
> >> To: gmx-users at gromacs.org
> >> Subject: RE: [gmx-users] Possible bug in parallelization, PME or	load-balancing on Gromacs 4.0_rc1 ??
> >> CC: gmx3 at hotmail.com
> >>
> >> I have some (hopefully) clarifying commments to my previous post now:
> >>
> >> First to answer your question regarding pme.c. My compilation was done
> >> from v. 1.125
> >> ------------
> >> Line 1037-
> >>      if ((kx>0) || (ky>0)) {
> >>                  kzstart = 0;
> >>              } else {
> >>                  kzstart = 1;
> >>                  p0++;
> >>              }
> >> ------
> >> As you can see the p0++; line is there.
> >>
> >> Now here are some additional points:
> >>
> >> On Mon, 29 Sep 2008, Bjørn Steen Sæthre wrote:
> >>
> >>> The only Error message I can find is the rather cryptic::
> >>>
> >>> NOTE: Turning on dynamic load balancing
> >>>
> >>> _pmii_daemon(SIGCHLD): PE 4 exit signal Killed
> >>> [NID 1412]Apid 159787: initiated application termination
> >>>
> >>> There are no error's apart from that.
> >>
> >>> Furthermore I can now report that this error is endemic in all my sims
> >>> using harmonic position restraints in GROMACS 4.0_beta1 and GMX
> >>> 4.0_rc1.
> >>>
> >>> About core dumps. I will talk to our HPC staff, and get back to you with
> >>> something more substantial I hope.
> >>>>
> >>
> >> OK, I have gotten some info from our HPC staff, they checked another job of
> >> mine which crashed in the exact same fashion, with the exact same starting
> >> run-topology and node configuration.
> >> They found some more info in the admin's log:
> >>
> >>> Hi,
> >>> this job got an OOM (out of memory), which is only recorded in the
> >>> system logs, not available directly to users:
> >>
> >>> [2008-09-29 17:18:18][c11-0c0s1n0]Out of memory: Killed process 8888
> >>> (parmdrun).
> >>
> >> I can also add that I have been able to stabilize the engine, by altering the
> >> cut-offs and lowering the total PME-load of the run, at the expense of far
> >> greater computational inefficiency.
> >>
> >> That is I went from unstable < to stable > as in the following diff on
> >> the mdp-files:
> >> -----------------------------
> >> 21c21
> >> < rlist                    = 0.9
> >> ---
> >>> rlist                    = 1.0
> >> 24c24
> >> < rcoulomb                 = 0.9
> >> ---
> >>> rcoulomb                 = 1.0
> >> 26c26
> >> < rvdw                     = 0.9
> >> ---
> >>> rvdw                     = 1.0
> >> 28,30c28,31
> >> < fourier_nx             = 60
> >> < fourier_ny             = 40
> >> < fourier_nz             = 40
> >> ---
> >>> fourier_nx             = 48
> >>> fourier_ny             = 32
> >>> fourier_nz             = 32
> >> 35c36
> >> ------------------------------
> >> That is, the  PME-workload went from 1/2 of nodes to 1/3 of them since I was
> >> using exactly the same startup configuration ---------------------
> >>
> >> This however, while enhancing stability, the output rate slowed down
> >> appreciably. And as shown in the log output, the reason is clear:
> >> ------------------------------------------------------------
> >> Making 2D domain decomposition 8 x 4 x 1
> >> starting mdrun 'Propane-hydrate prism (2x2x3 UC)'
> >> 2000000 steps,   4000.0 ps.
> >> Step 726095: Run time exceeded 3.960 hours, will terminate the run
> >>
> >> Step 726100: Run time exceeded 3.960 hours, will terminate the run
> >>
> >>   Average load imbalance: 26.7 %
> >>   Part of the total run time spent waiting due to load imbalance: 1.5 %
> >>   Average PME mesh/force load: 9.369
> >>   Part of the total run time spent waiting due to PP/PME imbalance: 57.5 %
> >>
> >> NOTE: 57.5 % performance was lost because the PME nodes
> >>        had more work to do than the PP nodes.
> >>        You might want to increase the number of PME nodes
> >>        or increase the cut-off and the grid spacing.
> >>
> >>
> >>          Parallel run - timing based on wallclock.
> >>
> >>                 NODE (s)   Real (s)      (%)
> >>         Time:   5703.000   5703.000    100.0
> >>                         1h35:03
> >>                 (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> >> Performance:     29.593      8.566     60.600      0.396
> >>
> >> gcq#0: Thanx for Using GROMACS - Have a Nice Day
> >> -----------------------------------------------
> >>
> >>
> >> One thing more is odd here though.
> >> In the startup script I allocated 4 hours, and set -maxh 4:
> >>
> >> -----------------------------------------------
> >> #PBS -l walltime=4:00:00,mppwidth=48,mppnppn=4
> >> cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K_2nd
> >> source $HOME/gmx_latest_290908/bin/GMXRC
> >> aprun -n 48 parmdrun -s topol.tpr -maxh 4 -npme 16
> >> exit $?
> >> -----------------------
> >>
> >> why the wallclock inconsistency (ie. wallclock is 1:35:03 which does not
> >> correspond to the note of 3.96 hours exceeded.)
> >>
> >>
> >>
> >> I hope this is helpful in resolving the issue brought up originally. (Might
> >> there be a possible memory leak somewhere?)
> >>
> >> Regards
> >> Bjørn
> >>
> >>
> >> PhD-student
> >> Insitute of Physics & Tech.- University of Bergen
> >> Allegt. 55,
> >> 5007 Bergen
> >> Norway
> >>
> >> Tel(office): +47 55582869
> >> Cell:        +47 99253386
> >> _______________________________________________
> >> gmx-users mailing list    gmx-users at gromacs.org
> >> http://www.gromacs.org/mailman/listinfo/gmx-users
> >> Please search the archive at http://www.gromacs.org/search before posting!
> >> Please don't post (un)subscribe requests to the list. Use the
> >> www interface or send it to gmx-users-request at gromacs.org.
> >> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
> >
> > _________________________________________________________________
> > Express yourself instantly with MSN Messenger! Download today it's FREE!
> > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20081001/22f9c03b/attachment.html>