[gmx-users] Fatal error in MPI_Allreduce upon REMD restart
Szilárd Páll
szilard.pall at cbr.su.se
Tue Oct 25 21:06:17 CEST 2011
Hi,
Firstly, you're not using the latest version and there might have been
a fix for your issue in the 4.5.5 patch release.
Secondly, you should check the http://redmine.gromacs.org bugtracker
to see what bugs have been fixed in 4.5.5 (ideally the target version
should tell). You can also just do a search for REMD and see what
matching bugs (open or closed) are in the database:
http://redmine.gromacs.org/search/index/gromacs?issues=1&q=REMD
Cheers,
--
Szilárd
On Tue, Oct 25, 2011 at 8:04 PM, Ben Reynwar <ben at reynwar.net> wrote:
> Hi all,
>
> I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
> It has occurred every time I have attempted an REMD restart.
> I'm posting here to check there's not something obviously wrong with
> the way I'm doing the restart which is causing it.
>
> I restart an REMD run using:
>
> -----------------------------------------------------------------------------------------------------------------------------------------
> basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
> status=${basedir}/pshsp_andva_run1_.status
> deffnm=${basedir}/pshsp_andva_run1_
> cpt=${basedir}/pshsp_andva_run0_.cpt
> tpr=${basedir}/pshsp_andva_run0_.tpr
> log=${basedir}/pshsp_andva_run1_0.log
> n_procs=32
>
> echo "about to check if log file exists"
> if [ ! -e $log ]; then
> echo "RUNNING" > $status
> source /usr/share/modules/init/bash
> module load intel-mpi
> module load intel-mkl
> module load gromacs
> echo "Calling mdrun"
> mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
> -cpi $cpt -deffnm $deffnm
> retval=$?
> if [ $retval != 0 ]; then
> echo "ERROR" > $status
> exit 1
> fi
> echo "FINISHED" > $status
> fi
> exit 0
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> mdrun then gets stuck and doesn't output anything until it is
> terminated by the queuing system.
> Upon termination the following output is written to stderr.
>
> [cli_5]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_31]: [cli_11]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_7]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_9]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_27]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_23]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_21]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_3]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_29]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_19]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_17]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_1]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_15]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
> rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
> _NULL) failed
> MPI_Allreduce(1051): Null communicator
> [cli_25]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> [cli_13]: aborting job:
> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
> failed
> MPI_Allreduce(1051): Null communicator
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
> VERSION 4.5.4 (singl
> e precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
> VERSION 4.5.4 (sing
> le precision)
> Terminated
>
> Cheers,
> Ben
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list