[gmx-users] Why REMD simulation becomes so slow when the number of replicas becomes large?

Mon Feb 7 13:01:16 CET 2011

On 7/02/2011 9:52 PM, Qiong Zhang wrote:
>
> Dear all gmx-users,
>
> I have recently been testing the REMD simulations. I was running 
> simulations on a supercomputer systembased on the AMD Opteron 12-core 
> (2.1 GHz) processors. The Gromacs 4.5.3 version was used.
>
> I have a system of 5172 atoms, of which 138 atoms belong to solute and 
> the other are water molecules. An exponential distribution of 
> temperatures was generated ranging from 276 to 515 K in total of 42 
> replicas or from 298 to 420 K in total of 24 replicas, ensuring that 
> the exchange ratio between all adjacent replicas is about 0.25. The 
> replica exchange was carried out every 0.5ps. The integrate step size 
> was 2fs.
>
> For the above system, when REMD is simulated over 24 replicas, the 
> simulation speed is reasonably fast. However, when REMD is simulated 
> over 42 replicas, the simulation speed is awfully slow.Please see the 
> following table for the speed.
>
> ----------------------------------------------------------------------------
>
> Replica numberCPU numberspeed
>
> 249658015steps/15minutes
>
> 4242865steps/15minutes
>
> 42841175steps/15minutes
>
> 421681875steps/15minutes
>
> 423362855steps/15minutes
>
> The command line for the mdrun is:
>
> aprun -n (CPU number here) mdrun_d -s md.tpr -multi (replica number 
> here) -replex 250
>
> My questions are :
>
> 1) why the REMD for the 42 replicas is so slow for the same system?
>
> 2) On what aspects can I improve the operating efficiency please?
>

What's the network hardware? Can other machine load influence your 
network performance?

Are the systems in the NVT ensemble? Use diff to check the .mdp files 
differ only how you think they do.

What are the values of nstlist and nstcalcenergy?

Take a look at the execution time breakdown at the end of the .log 
files, and do so for more than one replica. With the current 
implementation, every simulation has to synchronize and communicate 
every handful of steps, which means that large scale parallelism won't 
work efficiently unless you have fast network hardware that is dedicated 
to your job. This effect shows up in the "Rest" row of the time 
breakdown. With Infiniband, I'd expect you should only be losing about 
10% of the run time total. The 30-fold loss you have upon going from 
24->42 replicas keeping 4 CPUs/replica suggests some other contribution, 
however.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110207/aad9577f/attachment.html>