[gmx-users] mpich job hangs on exit
David van der Spoel
spoel at xray.bmc.uu.se
Wed Aug 4 11:30:33 CEST 2004
On Wed, 4 Aug 2004, Anton Feenstra wrote:
>I've encountered a problem with multi-node jobs (e.g. 6 cpu's devided over
>3 nodes). The symptom is that the accounting info at the end of the logfiles
>is only written partially. The trr, xtc and edr files are fine, they have
>been closed and the final frame was written, and also confout.gro is present.
>However, all mdrun processes are still using 100% CPU and fail to exit. For
>example, my confout.gro was written yesterday at 10.30, but today at 10.21,
>the mdruns are still 'active'. The solution is pretty simple - kill them,
>but this may point to a deeper problem.
>I've been able to find reports of similar problems in the maillist, but no
>followups with a solution... For now, the job is still hanging so today I
>will be able to have a look at some specifics of the jobs if necessary.
>Oh, almost forgot. I'm running on a 3GHz dual Xeon cluster, with Gb ethernet
>connect, MPICH 18.104.22.168, Gromacs 3.2.1, Intel cc/fc 7.1 and RedHat Enterprise 3.
>It happens with a 6 CPU job on GroEL/ES (14+7 subunits, plus water = 75k atoms),
>but also with a 450 residue protein in water (40k atoms) at 8 CPU's and above.
Maybe time to make a short test problem...
It could also be due to a combination of MPI and queueing system, but if
you can reproduce it in a ten step simulation you can try running with
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596, 75124 Uppsala, Sweden
phone: 46 18 471 4205 fax: 46 18 511 755
spoel at xray.bmc.uu.se spoel at gromacs.org http://xray.bmc.uu.se/~spoel
More information about the gromacs.org_gmx-users