[gmx-users] Continue run in Gromacs-4 with check point file

Mark Abraham Mark.Abraham at anu.edu.au
Fri Mar 20 04:44:14 CET 2009


xuji wrote:
>     Hi all:
>  
>  
>     
> I wrote an e-mail many days ago about continuing run in Gromacs-4.0 with check point file. But I can't solve this problem yet.
>     I run a simulation with
>       mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md_prev.cpt -cpo dppc_md.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr 
>     in 4 nodes. But when I continue to run the simulation with
>       mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md.cpt -cpo dppc_md_2.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr 
>     or with
>       mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md_prev.cpt -cpo dppc_md_2.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr 
>     because there're 2 check point file in the simulation directory, I tried both of them.
>     I always get following errors:
>  
>     Reading checkpoint file dppc_md_prev.cpt generated: Fri Mar 20 08:53:47 2009
>     or
>     Reading checkpoint file dppc_md.cpt generated: Fri Mar 20 08:58:08 2009 
>  
>     Loaded with Money
>     Fatal error in MPI_Bcast:
>     Message truncated, error stack:
>     MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fffc33242dc, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
>     MPIR_Bcast(229)...................: 
>     MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
>     Fatal error in MPI_Bcast:
>     Message truncated, error stack:
>     MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fff6c0da09c, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
>     MPIR_Bcast(229)...................: 
>     MPIDI_CH3U_Receive_data_found(254): Message from rank 4 and tag 2 truncated; 12 bytes received but buffer size is 4
>     Fatal error in MPI_Bcast:
>     Message truncated, error stack:
>     MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fff9ac2ebec, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
>     MPIR_Bcast(229)...................: 
>     MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
>     rank 16 in job 5  Node115_33001   caused collective abort of all ranks
>       exit status of rank 16: killed by signal 9 
>     rank 8 in job 5  Node115_33001   caused collective abort of all ranks
>       exit status of rank 8: killed by signal 9 
>     rank 6 in job 5  Node115_33001   caused collective abort of all ranks
>       exit status of rank 6: killed by signal 9 
>     
>     Can someone help me with this problem? Appreciate any help in advance!

This probably isn't intrinsically related to GROMACS. Probably there's
something changed in the way your MPI is configured between your early
and subsequent runs. You should simplify the problem down below 24(?)
processors to help diagnose, see if you can re-run the earlier
calculation now, and/or test that other simple MPI programs work.

Mark



More information about the gromacs.org_gmx-users mailing list