[gmx-users] mpich job hangs on exit

Anton Feenstra feenstra at chem.vu.nl
Wed Aug 4 15:34:58 CEST 2004


Hoi David (and perhaps others),


A bit more on this, since I'd like to prevent low-level debugging in mdrun
if I can. It is again the same mpich 8 CPU mdjob with -debug turned on.
This is what I get:
#> tail -1 md?.log gromacs_8.e1325
==> md0.log <==
Finished mdrun on node 0 Wed Aug  4 14:16:40 2004
==> md1.log <==
Finished mdrun on node 1 Wed Aug  4 14:16:40 2004
==> md2.log <==
Finished mdrun on node 2 Wed Aug  4 14:16:41 2004
==> md3.log <==
Finished mdrun on node 3 Wed Aug  4 14:16:41 2004
==> md4.log <==
Finished mdrun on node 4 Wed Aug  4 14:16:40 2004
==> md5.log <==
Finished mdrun on node 5 Wed Aug  4 14:16:40 2004
==> md6.log <==
Finished mdrun on node 6 Wed Aug  4 14:16:40 2004
==> md7.log <==
Finished mdrun on node 7 Wed Aug  4 14:16:40 2004
==> gromacs_8.e1325 <==
Performance:     13.133    439.943     12.000     83.333

So, they've all finished writing the regular logfiles and stderr (gromacs_8.e1325).
But, looking with 'fuser' on the nodes, they files are still open. A few
minutes later, the p4 error (timeout) and mpich's broken pipe come again
for CPU's #6 and #7, and the mdruns on that node die. The others are hanging, like:
#> rsh node20 ps -flu $USER
F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY          TIME CMD
4 S feenstra 21287 21286  0  85   0    -   424 wait4  14:16 ?        00:00:00 
/usr/local/Cluster-Apps/sge/utilbin/glinux/qrsh_starter /usr/local/Cluster-Apps/sg
0 S feenstra 21311 21287  0  85   0    -  1103 rt_sig 14:16 ?        00:00:00 tcsh -c 
/home/sgifar/feenstra/test_hang/test_debug/mdrun node11 47130 \-p4amslave
0 R feenstra 21333 21311 98  99  19    - 24130 -      14:16 ?        00:12:29 /home/sgifar/feenstra/test_hang/test_debug/mdrun 
node11 47130   4amslave -p4yourna
1 Z feenstra 21334 21333  0  76   0    -     0 do_exi 14:16 ?        00:00:00 [mdrun <defunct>]
4 S feenstra 21337 21336  0  85   0    -   421 wait4  14:16 ?        00:00:00 
/usr/local/Cluster-Apps/sge/utilbin/glinux/qrsh_starter /usr/local/Cluster-Apps/sg
0 S feenstra 21361 21337  0  85   0    -  1104 rt_sig 14:16 ?        00:00:00 tcsh -c 
/home/sgifar/feenstra/test_hang/test_debug/mdrun node11 47130 \-p4amslave
0 S feenstra 21383 21361 99  99  19    - 24201 -      14:16 ?        00:12:32 /home/sgifar/feenstra/test_hang/test_debug/mdrun 
node11 47130   4amslave -p4yourna
1 Z feenstra 21384 21383  0  75   0    -     0 do_exi 14:16 ?        00:00:00 [mdrun <defunct>]
4 S feenstra 21552 21551  0  85   0    -  1102 rt_sig 14:28 ?        00:00:00 tcsh -c ps -flu feenstra
0 R feenstra 21576 21552  0  85   0    -   795 -      14:28 ?        00:00:00 ps -flu feenstra

Note the two <defunct> mdruns. Any more hints on this?




-- 
Groetjes,

Anton
  _____________ _______________________________________________________
|             |                                                       |
|  _   _  ___,| K. Anton Feenstra                                     |
| / \ / \'| | | Dept. of Pharmacochem. - Vrije Universiteit Amsterdam |
|(   |   )| | | De Boelelaan 1083 - 1081 HV Amsterdam - Netherlands   |
| \_/ \_/ | | | Tel: +31 20 44 47608 - Fax: +31 20 44 47610           |
|             | Feenstra at chem.vu.nl - www.chem.vu.nl/~feenstra/       |
|             | "If You See Me Getting High, Knock Me Down"           |
|             | (Red Hot Chili Peppers)                               |
|_____________|_______________________________________________________|



More information about the gromacs.org_gmx-users mailing list