[gmx-users] Fatal error in MPI_Allreduce upon REMD restart
Ben Reynwar
ben at reynwar.net
Fri Oct 28 01:21:29 CEST 2011
On Thu, Oct 27, 2011 at 4:09 PM, Mark Abraham <Mark.Abraham at anu.edu.au> wrote:
> On 28/10/2011 9:29 AM, Ben Reynwar wrote:
>>
>> I found a post on the devel list from a couple of weeks ago where a
>> fix is given and it appears to work for me.
>> The link is
>> http://lists.gromacs.org/pipermail/gmx-developers/2011-October/005405.html
>>
>> The fix does not appear to have been incorporated into the 4.5.5 release.
>
> Yes, this is the bug in 4.5.5 to which I referred. Your original post didn't
> state a GROMACS version, but your output suggested 4.5.4, else I'd have told
> you this was the bug... instead everyone wasted some time :)
>
> Mark
>
I thought I was using 4.5.4 too. It turns out I wasn't.
Cheers,
Ben
>>
>> On Tue, Oct 25, 2011 at 4:33 PM, Mark Abraham<Mark.Abraham at anu.edu.au>
>> wrote:
>>>
>>> On 26/10/2011 6:06 AM, Szilárd Páll wrote:
>>>>
>>>> Hi,
>>>>
>>>> Firstly, you're not using the latest version and there might have been
>>>> a fix for your issue in the 4.5.5 patch release.
>>>
>>> There was a bug in 4.5.5 that was not present in 4.5.4 that could have
>>> produced such symptoms, but it was fixed without creating a Redmine
>>> issue.
>>>
>>>> Secondly, you should check the http://redmine.gromacs.org bugtracker
>>>> to see what bugs have been fixed in 4.5.5 (ideally the target version
>>>> should tell). You can also just do a search for REMD and see what
>>>> matching bugs (open or closed) are in the database:
>>>> http://redmine.gromacs.org/search/index/gromacs?issues=1&q=REMD
>>>
>>> If the OP is right and this was with 4.5.4 and can be reproduced with
>>> 4.5.5,
>>> please do some testing (e.g. Do different parallel regimes produce the
>>> same
>>> symptoms? Can the individual replicas run in a non-REMD simulation?) and
>>> file a Redmine issue with your observations and a small sample case.
>>>
>>> Mark
>>>
>>>> Cheers,
>>>> --
>>>> Szilárd
>>>>
>>>>
>>>>
>>>> On Tue, Oct 25, 2011 at 8:04 PM, Ben Reynwar<ben at reynwar.net> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm getting errors in MPI_Allreduce what I restart an REMD simulation.
>>>>> It has occurred every time I have attempted an REMD restart.
>>>>> I'm posting here to check there's not something obviously wrong with
>>>>> the way I'm doing the restart which is causing it.
>>>>>
>>>>> I restart an REMD run using:
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------------------------------
>>>>> basedir=/scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_
>>>>> status=${basedir}/pshsp_andva_run1_.status
>>>>> deffnm=${basedir}/pshsp_andva_run1_
>>>>> cpt=${basedir}/pshsp_andva_run0_.cpt
>>>>> tpr=${basedir}/pshsp_andva_run0_.tpr
>>>>> log=${basedir}/pshsp_andva_run1_0.log
>>>>> n_procs=32
>>>>>
>>>>> echo "about to check if log file exists"
>>>>> if [ ! -e $log ]; then
>>>>> echo "RUNNING"> $status
>>>>> source /usr/share/modules/init/bash
>>>>> module load intel-mpi
>>>>> module load intel-mkl
>>>>> module load gromacs
>>>>> echo "Calling mdrun"
>>>>> mpirun -np 32 mdrun-mpi -maxh 24 -multi 16 -replex 1000 -s $tpr
>>>>> -cpi $cpt -deffnm $deffnm
>>>>> retval=$?
>>>>> if [ $retval != 0 ]; then
>>>>> echo "ERROR"> $status
>>>>> exit 1
>>>>> fi
>>>>> echo "FINISHED"> $status
>>>>> fi
>>>>> exit 0
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> mdrun then gets stuck and doesn't output anything until it is
>>>>> terminated by the queuing system.
>>>>> Upon termination the following output is written to stderr.
>>>>>
>>>>> [cli_5]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x2b379c00b770, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_31]: [cli_11]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7f489806bf60, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7fd960002fc0, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_7]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754400,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_9]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x757230,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_27]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7fb3cc02a450, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_23]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x750970,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_21]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7007b0,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_3]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754360,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_29]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x756460,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_19]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7f60a0066850, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_17]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7f4bdc07b690, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_1]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x754430,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_15]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7fc31407c830, count=16, MPI_INT, MPI_SUM, MPI_COMM
>>>>> _NULL) failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_25]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6e1830,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> [cli_13]: aborting job:
>>>>> Fatal error in MPI_Allreduce: Invalid communicator, error stack:
>>>>> MPI_Allreduce(1175): MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x6c2430,
>>>>> count=16, MPI_INT, MPI_SUM, MPI_COMM_NULL)
>>>>> failed
>>>>> MPI_Allreduce(1051): Null communicator
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_3.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_0.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_7.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_6.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_1.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_4.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_5.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_2.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_11.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_9.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_8.tpr,
>>>>> VERSION 4.5.4 (singl
>>>>> e precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_10.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_15.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_13.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_12.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Reading file
>>>>>
>>>>> /scr2/benreynwar/home-ben-sHSP-REMD-pshsp_andva_run1_/pshsp_andva_run0_14.tpr,
>>>>> VERSION 4.5.4 (sing
>>>>> le precision)
>>>>> Terminated
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>> --
>>>>> gmx-users mailing list gmx-users at gromacs.org
>>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>>> Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>>> Please don't post (un)subscribe requests to the list. Use the
>>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>> --
>>> gmx-users mailing list gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the www
>>> interface
>>> or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list