[gmx-users] Simulation time losses with REMD

Sat Jan 29 09:23:30 CET 2011

On 28/01/2011 4:46 PM, Mark Abraham wrote:
> Hi,
>
> I compared the .log file time accounting for same .tpr file run alone 
> in serial or as part of an REMD simulation (with each replica on a 
> single proessor). It ran about 5-10% slower in the latter. The effect 
> was a bit larger when comparing the same .tpr on 8 processors with 
> REMD with 8 processers per replica. The effect seems fairly 
> independent of whether I compare the lowest or highest replica.

OK I found the issue by binary-searching the code looking for the 
offending line. It's in compute_globals() in src/kernel/md.c. The call 
to gmx_sum_sim consumes all the extra time. This code is taking care of 
synchronization for possibly doing checkpointing.

                 if (MULTISIM(cr) && bInterSimGS)
                 {
                     if (MASTER(cr))
                     {
                         /* Communicate the signals between the 
simulations */
                         gmx_sum_sim(eglsNR,gs_buf,cr->ms);
                     }
                     /* Communicate the signals form the master to the 
others */
                     gmx_bcast(eglsNR*sizeof(gs_buf[0]),gs_buf,cr);
                 }

This eventually calls

void gmx_sumf_comm(int nr,float r[],MPI_Comm mpi_comm)
{
#if defined(MPI_IN_PLACE_EXISTS) || defined(GMX_THREADS)
     MPI_Allreduce(MPI_IN_PLACE,r,nr,MPI_FLOAT,MPI_SUM,mpi_comm);
#else
     /* this function is only used in code that is not performance 
critical,
        (during setup, when comm_rec is not the appropriate communication
        structure), so this isn't as bad as it looks. */
     float *buf;
     int i;

     snew(buf, nr);
     MPI_Allreduce(r,buf,nr,MPI_FLOAT,MPI_SUM,mpi_comm);
     for(i=0; i<nr; i++)
         r[i] = buf[i];
     sfree(buf);
#endif
}

Clearly the comment is out of date. My nstlist=5, repl_ex_nst=2500 and 
nstcalcenergy=-1, so that triggers gs.nstms=5 and so bInterSimGS is TRUE 
every 5 steps. I'm not sure whether the problem is with nstlist, or the 
multi-simulation checkpointing engineering, or what.

Mark