[gmx-users] Fatal error in MPI_Allreduce upon REMD restart
Ben Reynwar
ben at reynwar.net
Tue Oct 25 20:04:02 CEST 2011
Hi all,
I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
It has occurred every time I have attempted an REMD restart.
I'm posting here to check there's not something obviously wrong with
the way I'm doing the restart which is causing it.
I restart an REMD run using:
-----------------------------------------------------------------------------------------------------------------------------------------
basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
status=${basedir}/pshsp_andva_run1_.status
deffnm=${basedir}/pshsp_andva_run1_
cpt=${basedir}/pshsp_andva_run0_.cpt
tpr=${basedir}/pshsp_andva_run0_.tpr
log=${basedir}/pshsp_andva_run1_0.log
n_procs=32
echo "about to check if log file exists"
if [ ! -e $log ]; then
echo "RUNNING" > $status
source /usr/share/modules/init/bash
module load intel-mpi
module load intel-mkl
module load gromacs
echo "Calling mdrun"
mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
-cpi $cpt -deffnm $deffnm
retval=$?
if [ $retval != 0 ]; then
echo "ERROR" > $status
exit 1
fi
echo "FINISHED" > $status
fi
exit 0
------------------------------------------------------------------------------------------------------------------------------------------
mdrun then gets stuck and doesn't output anything until it is
terminated by the queuing system.
Upon termination the following output is written to stderr.
[cli_5]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_31]: [cli_11]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_7]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_9]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_27]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_23]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_21]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_3]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_29]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_19]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_17]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_1]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_15]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_25]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
[cli_13]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
failed
MPI_Allreduce(1051): Null communicator
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
VERSION 4.5.4 (sing
le precision)
Terminated
Cheers,
Ben
More information about the gromacs.org_gmx-users
mailing list