[gmx-users] Continue run in Gromacs-4 with check point file
Mark Abraham
Mark.Abraham at anu.edu.au
Fri Mar 20 04:44:14 CET 2009
xuji wrote:
> Hi all:
>
>
>
> I wrote an e-mail many days ago about continuing run in Gromacs-4.0 with check point file. But I can't solve this problem yet.
> I run a simulation with
> mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md_prev.cpt -cpo dppc_md.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr
> in 4 nodes. But when I continue to run the simulation with
> mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md.cpt -cpo dppc_md_2.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr
> or with
> mpiexec -machinefile ./mf_24 -np 24 mdrun -v -append -cpt 5 -cpi dppc_md_prev.cpt -cpo dppc_md_2.cpt -s dppc_md.tpr -o dppc_md.trr -c dppc_md.gro -g dppc_md.log -e dppc_md.edr
> because there're 2 check point file in the simulation directory, I tried both of them.
> I always get following errors:
>
> Reading checkpoint file dppc_md_prev.cpt generated: Fri Mar 20 08:53:47 2009
> or
> Reading checkpoint file dppc_md.cpt generated: Fri Mar 20 08:58:08 2009
>
> Loaded with Money
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fffc33242dc, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(229)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fff6c0da09c, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(229)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 4 and tag 2 truncated; 12 bytes received but buffer size is 4
> Fatal error in MPI_Bcast:
> Message truncated, error stack:
> MPI_Bcast(1145)...................: MPI_Bcast(buf=0x7fff9ac2ebec, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(229)...................:
> MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; 12 bytes received but buffer size is 4
> rank 16 in job 5 Node115_33001 caused collective abort of all ranks
> exit status of rank 16: killed by signal 9
> rank 8 in job 5 Node115_33001 caused collective abort of all ranks
> exit status of rank 8: killed by signal 9
> rank 6 in job 5 Node115_33001 caused collective abort of all ranks
> exit status of rank 6: killed by signal 9
>
> Can someone help me with this problem? Appreciate any help in advance!
This probably isn't intrinsically related to GROMACS. Probably there's
something changed in the way your MPI is configured between your early
and subsequent runs. You should simplify the problem down below 24(?)
processors to help diagnose, see if you can re-run the earlier
calculation now, and/or test that other simple MPI programs work.
Mark
More information about the gromacs.org_gmx-users
mailing list