[gmx-users] Fatal error in MPI_Allreduce upon REMD restart

Ben Reynwar ben at reynwar.net
Tue Oct 25 20:04:02 CEST 2011


Hi all,

I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
 It has occurred every time I have attempted an REMD restart.
I'm posting here to check there's not something obviously wrong with
the way I'm doing the restart which is causing it.

I restart an REMD run using:

-----------------------------------------------------------------------------------------------------------------------------------------
basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
status=${basedir}/pshsp_andva_run1_.status
deffnm=${basedir}/pshsp_andva_run1_
cpt=${basedir}/pshsp_andva_run0_.cpt
tpr=${basedir}/pshsp_andva_run0_.tpr
log=${basedir}/pshsp_andva_run1_0.log
n_procs=32

echo "about to check if log file exists"
if [ ! -e $log ]; then
    echo "RUNNING" > $status
    source /usr/share/modules/init/bash
    module load intel-mpi
    module load intel-mkl
    module load gromacs
    echo "Calling mdrun"
    mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
-cpi $cpt -deffnm $deffnm
    retval=$?
    if [ $retval != 0 ]; then
        echo "ERROR" > $status
        exit 1
    fi
    echo "FINISHED" > $status
fi
exit 0
------------------------------------------------------------------------------------------------------------------------------------------

mdrun then gets stuck and doesn't output anything until it is
terminated by the queuing system.
Upon termination the following output is written to stderr.

[cli_5]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_31]: [cli_11]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_7]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_9]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_27]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_23]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_21]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_3]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_29]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_19]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_17]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_1]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_15]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
_NULL) failed
MPI_Allreduce(1051): Null communicator
[cli_25]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
[cli_13]: aborting job:
Fatal error in MPI_Allreduce: Invalid communicator, error stack:
MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
 failed
MPI_Allreduce(1051): Null communicator
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
VERSION 4.5.4 (singl
e precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
VERSION 4.5.4 (sing
le precision)
Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
VERSION 4.5.4 (sing
le precision)
Terminated

Cheers,
Ben



More information about the gromacs.org_gmx-users mailing list