[gmx-users] mpich job hangs on exit
Anton Feenstra
feenstra at chem.vu.nl
Wed Aug 4 15:34:58 CEST 2004
Hoi David (and perhaps others),
A bit more on this, since I'd like to prevent low-level debugging in mdrun
if I can. It is again the same mpich 8 CPU mdjob with -debug turned on.
This is what I get:
#> tail -1 md?.log gromacs_8.e1325
==> md0.log <==
Finished mdrun on node 0 Wed Aug 4 14:16:40 2004
==> md1.log <==
Finished mdrun on node 1 Wed Aug 4 14:16:40 2004
==> md2.log <==
Finished mdrun on node 2 Wed Aug 4 14:16:41 2004
==> md3.log <==
Finished mdrun on node 3 Wed Aug 4 14:16:41 2004
==> md4.log <==
Finished mdrun on node 4 Wed Aug 4 14:16:40 2004
==> md5.log <==
Finished mdrun on node 5 Wed Aug 4 14:16:40 2004
==> md6.log <==
Finished mdrun on node 6 Wed Aug 4 14:16:40 2004
==> md7.log <==
Finished mdrun on node 7 Wed Aug 4 14:16:40 2004
==> gromacs_8.e1325 <==
Performance: 13.133 439.943 12.000 83.333
So, they've all finished writing the regular logfiles and stderr (gromacs_8.e1325).
But, looking with 'fuser' on the nodes, they files are still open. A few
minutes later, the p4 error (timeout) and mpich's broken pipe come again
for CPU's #6 and #7, and the mdruns on that node die. The others are hanging, like:
#> rsh node20 ps -flu $USER
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S feenstra 21287 21286 0 85 0 - 424 wait4 14:16 ? 00:00:00
/usr/local/Cluster-Apps/sge/utilbin/glinux/qrsh_starter /usr/local/Cluster-Apps/sg
0 S feenstra 21311 21287 0 85 0 - 1103 rt_sig 14:16 ? 00:00:00 tcsh -c
/home/sgifar/feenstra/test_hang/test_debug/mdrun node11 47130 \-p4amslave
0 R feenstra 21333 21311 98 99 19 - 24130 - 14:16 ? 00:12:29 /home/sgifar/feenstra/test_hang/test_debug/mdrun
node11 47130 4amslave -p4yourna
1 Z feenstra 21334 21333 0 76 0 - 0 do_exi 14:16 ? 00:00:00 [mdrun <defunct>]
4 S feenstra 21337 21336 0 85 0 - 421 wait4 14:16 ? 00:00:00
/usr/local/Cluster-Apps/sge/utilbin/glinux/qrsh_starter /usr/local/Cluster-Apps/sg
0 S feenstra 21361 21337 0 85 0 - 1104 rt_sig 14:16 ? 00:00:00 tcsh -c
/home/sgifar/feenstra/test_hang/test_debug/mdrun node11 47130 \-p4amslave
0 S feenstra 21383 21361 99 99 19 - 24201 - 14:16 ? 00:12:32 /home/sgifar/feenstra/test_hang/test_debug/mdrun
node11 47130 4amslave -p4yourna
1 Z feenstra 21384 21383 0 75 0 - 0 do_exi 14:16 ? 00:00:00 [mdrun <defunct>]
4 S feenstra 21552 21551 0 85 0 - 1102 rt_sig 14:28 ? 00:00:00 tcsh -c ps -flu feenstra
0 R feenstra 21576 21552 0 85 0 - 795 - 14:28 ? 00:00:00 ps -flu feenstra
Note the two <defunct> mdruns. Any more hints on this?
--
Groetjes,
Anton
_____________ _______________________________________________________
| | |
| _ _ ___,| K. Anton Feenstra |
| / \ / \'| | | Dept. of Pharmacochem. - Vrije Universiteit Amsterdam |
|( | )| | | De Boelelaan 1083 - 1081 HV Amsterdam - Netherlands |
| \_/ \_/ | | | Tel: +31 20 44 47608 - Fax: +31 20 44 47610 |
| | Feenstra at chem.vu.nl - www.chem.vu.nl/~feenstra/ |
| | "If You See Me Getting High, Knock Me Down" |
| | (Red Hot Chili Peppers) |
|_____________|_______________________________________________________|
More information about the gromacs.org_gmx-users
mailing list