[gmx-developers] remd crash with git head

David van der Spoel spoel at xray.bmc.uu.se
Wed Jul 7 08:05:04 CEST 2010


On 7/6/10 7:36 PM, David van der Spoel wrote:
> On 7/6/10 4:57 PM, David van der Spoel wrote:
>> On 2010-07-06 16.19, hess at sbc.su.se wrote:
>>>> On 2010-07-06 15.23, hess at sbc.su.se wrote:
>>>>> Hi,
>>>>>
>>>>> I introduced new mdp parameters today.
>>>>> If this is with code of today, please check that you recompiled
>>>>> everything.
>>>>
>>>> Doesn't help unfortunately. Which mdp parameters do you mean? Do they
>>>> influence REMD? I found nothing obvious in mdout.mdp.
>>>
>>> No, unused nst/pcouple and fe parameters.
>>>
>>> I just did git pull and make.
>>> remd works for me both with old/new tpr files and the new mdrun.
>>> \Did
>>
>>
>> static void repl_quantity(FILE *fplog,const gmx_multisim_t *ms,
>> struct gmx_repl_ex *re,int ere,real q)
>>
>>
>> It goes wrong in gmx_sumf_sim,
>> only the first half of the array gets summed and with weird values.
>> Looks like somewhere it is treated as half-long array of doubles or so.
> It has been resolved. Any code using this function should have crashed.
>
> Fixed bug in gmx_sumf_comm where a double array was passed to an MPI
> function expecting a float array.
>
> Now after the next crash:
>
> Fatal error:
> Can not find an appropriate interval for inter-simulation communication,
> since nstlist (5), nstcalcenergy (5) and -replex (2500) are all <= 0
>
>
This seems to be due to a bug in MPI_Allreduce when called from the 
following routine (note that I added debug statements). Each processor 
fills on array element corresponding to its CPU id, and then the whole 
integer array is summed. Result should be the same number on all nodes.

void gmx_sumi_sim(int nr,int r[], const gmx_multisim_t *ms)
{
#ifndef GMX_MPI
     gmx_call("gmx_sumi");
#else
#if defined(MPI_IN_PLACE_EXISTS) || defined(GMX_THREADS)
     MPI_Allreduce(MPI_IN_PLACE,r,nr,MPI_INT,MPI_SUM,ms->mpi_comm_masters);
#else
     /* this is thread-unsafe, but it will do for now: */
     int i;

     if (nr > ms->mpb->ibuf_alloc) {
         ms->mpb->ibuf_alloc = nr;
         srenew(ms->mpb->ibuf,ms->mpb->ibuf_alloc);
     }
     for(i=0; (i<nr); i++)
     {
         ms->mpb->ibuf[i] = 0;
         printf("b4 allreduce node %d r[%d] = %d\n",ms->sim,i,r[i]);
     }
     MPI_Allreduce(r,ms->mpb->ibuf,nr,MPI_INT,MPI_SUM,ms->mpi_comm_masters);
     for(i=0; i<nr; i++)
     {
         r[i] = ms->mpb->ibuf[i];
         printf("after allreduce node %d r[%d] = %d\n",ms->sim,i,r[i]);
     }
#endif
#endif
}

Result is:
b4 allreduce node 0 r[0] = 26280
b4 allreduce node 0 r[1] = 0
b4 allreduce node 0 r[2] = 0
b4 allreduce node 0 r[3] = 0
b4 allreduce node 0 r[4] = 0
b4 allreduce node 0 r[5] = 0
b4 allreduce node 0 r[6] = 0
b4 allreduce node 0 r[7] = 0
<snip>
after allreduce node 0 r[0] = 0
after allreduce node 0 r[1] = 0
after allreduce node 0 r[2] = 0
after allreduce node 0 r[3] = 0
after allreduce node 0 r[4] = 0
after allreduce node 0 r[5] = 0
after allreduce node 0 r[6] = 0
after allreduce node 0 r[7] = 0

In other words the buffer does not get summed. What is wrong here?

OK formulating an email at least makes you think again. This was the 
same problem as yesterday: the int buffer was declared as float *ibuf in 
commrec.h (yesterday it was a float buffer declared as double in network.c).

I guess most MPI have the MPI_IN_PLACE installed such that this was 
overlooked so far?
-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Professor of Biology
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



More information about the gromacs.org_gmx-developers mailing list