[gmx-users] MPI at cluster

Fri May 30 15:10:01 CEST 2003

First of all, thank you for my questions: strategy of clustering 10node.  

I have done some jobs on 10-node cluster and noticed that the system showed quite different running time on 2cpu node.  I think since each cpu has different calculation task and latency (between machines), the time can be different. 
If so, can I regard the time gap between nodes as indirect info of scaling ? (The smaller time gap the better performance.)  Does it cause any problem like job crash in the end ?  
It became larger and larger, one cpu became idle (no further calculation saved), finally the system crashed. This is what I experienced, but I don't have the data with me. Please find related data from current job below.

If such differences could relate to the cause of the crash, how can I avoid it ?

------------
The following job is running as of now..(Gmx3.1.4, lammpi5.6.8, fftw2.1.3: -AMD cluster)
% top from 3nodes
1.  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
2864 yipg2     19  19 26660  25M  2832 R N  99.0  5.1  6282m mdrun_mpi
2865 yipg2     19  19 22592  21M  2688 S N  78.2  4.3  5339m mdrun_mpi

2. PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
2070 yipg2     19  19 19844  19M  2632 R N  98.3  3.7  6223m mdrun_mpi
2071 yipg2     19  19 19664  18M  2620 R N  89.4  3.7  6040m mdrun_mpi

3.  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
14192 yipg2     19  19 19420  18M  2620 S N  80.6  3.7  4674m mdrun_mpi
14191 yipg2     19  19 19424  18M  2632 S N  36.3  3.7  3322m mdrun_mpi
---------------

Thanks,
Taeho