[gmx-users] Fatal error in MPI_Allreduce upon REMD restart

Mark Abraham Mark.Abraham at anu.edu.au
Wed Oct 26 01:33:33 CEST 2011


On 26/10/2011 6:06 AM, Szilárd Páll wrote:
> Hi,
>
> Firstly, you're not using the latest version and there might have been
> a fix for your issue in the 4.5.5 patch release.

There was a bug in 4.5.5 that was not present in 4.5.4 that could have 
produced such symptoms, but it was fixed without creating a Redmine issue.

> Secondly, you should check the http://redmine.gromacs.org bugtracker
> to see what bugs have been fixed in 4.5.5 (ideally the target version
> should tell). You can also just do a search for REMD and see what
> matching bugs (open or closed) are in the database:
> http://redmine.gromacs.org/search/index/gromacs?issues=1&q=REMD

If the OP is right and this was with 4.5.4 and can be reproduced with 
4.5.5, please do some testing (e.g. Do different parallel regimes 
produce the same symptoms? Can the individual replicas run in a non-REMD 
simulation?) and file a Redmine issue with your observations and a small 
sample case.

Mark

>
> Cheers,
> --
> Szilárd
>
>
>
> On Tue, Oct 25, 2011 at 8:04 PM, Ben Reynwar<ben at reynwar.net>  wrote:
>> Hi all,
>>
>> I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
>>   It has occurred every time I have attempted an REMD restart.
>> I'm posting here to check there's not something obviously wrong with
>> the way I'm doing the restart which is causing it.
>>
>> I restart an REMD run using:
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------
>> basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
>> status=${basedir}/pshsp_andva_run1_.status
>> deffnm=${basedir}/pshsp_andva_run1_
>> cpt=${basedir}/pshsp_andva_run0_.cpt
>> tpr=${basedir}/pshsp_andva_run0_.tpr
>> log=${basedir}/pshsp_andva_run1_0.log
>> n_procs=32
>>
>> echo "about to check if log file exists"
>> if [ ! -e $log ]; then
>>     echo "RUNNING">  $status
>>     source /usr/share/modules/init/bash
>>     module load intel-mpi
>>     module load intel-mkl
>>     module load gromacs
>>     echo "Calling mdrun"
>>     mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
>> -cpi $cpt -deffnm $deffnm
>>     retval=$?
>>     if [ $retval != 0 ]; then
>>         echo "ERROR">  $status
>>         exit 1
>>     fi
>>     echo "FINISHED">  $status
>> fi
>> exit 0
>> ------------------------------------------------------------------------------------------------------------------------------------------
>>
>> mdrun then gets stuck and doesn't output anything until it is
>> terminated by the queuing system.
>> Upon termination the following output is written to stderr.
>>
>> [cli_5]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_31]: [cli_11]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_7]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_9]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_27]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_23]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_21]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_3]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_29]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_19]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_17]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_1]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_15]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
>> _NULL) failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_25]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> [cli_13]: aborting job:
>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>   failed
>> MPI_Allreduce(1051): Null communicator
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
>> VERSION 4.5.4 (singl
>> e precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Reading file /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
>> VERSION 4.5.4 (sing
>> le precision)
>> Terminated
>>
>> Cheers,
>> Ben
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>




More information about the gromacs.org_gmx-users mailing list