[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
gmx3 at hotmail.com
Mon Sep 29 17:03:42 CEST 2008
You really do not have any error or warning messages at the end of your log
file, stdlog or stderr?
Up to now there has been only one report of problems.
This is on a cray xt4, where some dlb jobs (with initial empty cells)
stop at step 10 with the error message that some cell dimensions have become 0.
Unfortunately I can not reproduce this on an x86_64 linux machine.
So we will have to do some xt4 debugging.
Can you produce core dump files?
> From: st01397 at student.uib.no
> To: gmx-users at gromacs.org
> Date: Mon, 29 Sep 2008 15:21:34 +0200
> Subject: [gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
> I am running some annealing trials on a Cray XT4. And although the
> throughput is impressive, I have severe difficulties with stability of
> the code.
> For my relatively small system of ~7500 atoms the engine typically crash
> after ~500k steps.
> I am using the bleeding-edge CVS version: mdrun.c (1.141) (the newest
> one after Erik L.'s recent patch of the PME code)
> I configure and compile on the compute nodes exclusively (not the
> frontend) and the only compiler warning(s) I get are of the type:
> "warning: Using 'getpwuid' in statically linked applications requires
> at runtime the shared libraries from the glibc version used for linking"
> After compile though, the code executes and runs for ~20mins, producing
> sound data before stalling.
> The error logs are very short and quite uniformative.
> PBS .o:
> Application 159316 exit codes: 137
> Application 159316 exit signals: Killed
> Application 159316 resources: utime 0, stime 0
> Begin PBS Epilogue hexagon.bccs.uib.no
> Date: Mon Sep 29 12:32:54 CEST 2008
> Job ID: 65643.nid00003
> Username: bjornss
> Group: bjornss
> Job Name: pmf_hydanneal_heatup_400K
> Session: 10156
> Limits: walltime=05:00:00
> Queue: batch
> Account: fysisk
> Base login-node: login5
> End PBS Epilogue Mon Sep 29 12:32:54 CEST 2008
> PBS .err:
> _pmii_daemon(SIGCHLD): PE 0 exit signal Killed
> [NID 702]Apid 159316: initiated application termination.
> As proper electrostatics is crucial to my modeling I am using PME which
> comprises a large part of my calculation cost: 35-50%
> In the most extreme case, I use the following startup-script
> #PBS -A fysisk
> #PBS -N pmf_hydanneal_heatup_400K
> #PBS -o pmf_hydanneal.o
> #PBS -e pmf.hydanneal.err
> #PBS -l walltime=5:00:00,mppwidth=40,mppnppn=4
> cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K
> source $HOME/gmx_latest_290908/bin/GMXRC
> aprun -n 40 parmdrun -s topol.tpr -maxh 5 -npme 20
> exit $?
> Now, apart from a significant reduction in the system dipole moment,
> there are no large changes in the system, nor significant translations
> of the molecules in the box.
> I enclose the md.log and my parameter file. The run-topology (topol.tpr)
> can be found at:
> if anyone wants to try and replicate the crash on their local cluster,
> they are welcome.
> If after such trials are attempted the error persists, I am willing to
> post a bug on bugzilla.
> If more information is needed I will try to provide it upon request
> Regards and thanks for bothering
> Bjørn Steen Saethre
> Theoretical and Energy Physics Unit
> Institute of Physics and Technology
> Allegt, 41
> N-5020 Bergen
> Tel(office) +47 55582869
Express yourself instantly with MSN Messenger! Download today it's FREE!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gromacs.org_gmx-users