[gmx-users] mpich job hangs on exit

David van der Spoel spoel at xray.bmc.uu.se
Wed Aug 4 11:30:33 CEST 2004


On Wed, 4 Aug 2004, Anton Feenstra wrote:

>Hi All,
>
>
>I've encountered a problem with multi-node jobs (e.g. 6 cpu's devided over
>3 nodes). The symptom is that the accounting info at the end of the logfiles
>is only written partially. The trr, xtc and edr files are fine, they have
>been closed and the final frame was written, and also confout.gro is present.
>However, all mdrun processes are still using 100% CPU and fail to exit. For
>example, my confout.gro was written yesterday at 10.30, but today at 10.21,
>the mdruns are still 'active'. The solution is pretty simple - kill them,
>but this may point to a deeper problem.
>
>I've been able to find reports of similar problems in the maillist, but no
>followups with a solution... For now, the job is still hanging so today I
>will be able to have a look at some specifics of the jobs if necessary.
>
>Oh, almost forgot. I'm running on a 3GHz dual Xeon cluster, with Gb ethernet
>connect, MPICH 1.2.5.2, Gromacs 3.2.1, Intel cc/fc 7.1 and RedHat Enterprise 3.
>
>It happens with a 6 CPU job on GroEL/ES (14+7 subunits, plus water = 75k atoms),
>but also with a 450 residue protein in water (40k atoms) at 8 CPU's and above.


Maybe time to make a short test problem...
It could also be due to a combination of MPI and queueing system, but if 
you can reproduce it in a ten step simulation you can try running with 
-debug,

-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://xray.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




More information about the gromacs.org_gmx-users mailing list