[gmx-users] Why REMD simulation becomes so slow when the number of replicas becomes large?

Mark Abraham Mark.Abraham at anu.edu.au
Mon Feb 7 16:36:04 CET 2011


On 8/02/2011 1:48 AM, Qiong Zhang wrote:
>
> Hi Mark,
>
> Many thanks for your fast response!
>
> /What's the network hardware? Can other machine load influence your 
> network performance?/
>
> The supercomputer system is based on the Cray Gemini interconnect 
> technology. I suppose this is a fast network hardware...
>
>
> /Are the systems in the NVT ensemble? Use diff to check the .mdp files 
> differ only how you think they do./
>
> The systems are in NPT ensemble. I saw some discussions on the mailing 
> list that NPT ensemble is superior to NVT ensemble for REMD. And the 
> .mdp files differ only in the temperature.
>

Maybe so, but under NPT the density varies with T, and so with replica. 
This means the size of neighbour lists varies, and the cost of the 
computation (PME or not) varies. The generalized ensemble is limited by 
the progress of the slowest replica. If using PME, in theory, you can 
juggle the contribution of the various terms to balance the computation 
load across the replicas, but this is not easy to do.
>
> /What are the values of nstlist andnstcalcenergy?/
>
> Previously, nstlist=5, nstcalcenergy=1
>
> Thank you for pointing this out. I checked the manual again that this 
> option affects the performance in parallel simulations because 
> calculating energies requires global communication between all 
> processes. So I have set this option to -1 this time. This should be 
> one reason for the low parallel efficiency.
>
> And after I changed nstcalcenergy=-1, Ifound there was a 3% 
> improvement on the efficiency compared with those when nstcalcenergy=1.
>

Yep. nstpcouple and nsttcouple also influence this.

> Take a look at the execution time breakdown at the end of the .log 
> files, and do so for more than one replica. With the current 
> implementation, every simulation has to synchronize and communicate 
> every handful of steps, which means that large scale parallelism won't 
> work efficiently unless you havefast network hardware that is 
> dedicated to your job. This effect shows up in the "Rest" row of the 
> time breakdown. With Infiniband, I'd expect you should only be losing 
> about 10% of the run time total. The 30-fold loss you have upon going 
> from 24->42 replicas keeping 4 CPUs/replica suggests some other 
> contribution, however.
>
> I checked the time breakdown in the log files for short REMD 
> simulations. For the REMD simulaiton with 168 cores for 42 replicas, 
> as you see below, the “Rest” makes up as surprisingly high as 
> *_96.6%_* of the time for one of the replicas. This parameter is 
> almost the same level for the other replicas. For the REMD simulation 
> with 96 cores for 24 replicas, the “Rest” takes up about 24%. I was 
> also aware of your post:
>
> http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html
>
> As you suggested such big loss should be ascribed to other factors. Do 
> you think it is the network hardware to blame or there are other 
> reasons please? Any suggestion would be greatly appreciated
>

I expect the load imbalance across replicas is partly to blame. Look at 
the sum of Force + PME mesh (in seconds) across the generalized 
ensemble. That's where the simulation work is all done, and I expect 
your low-temperature replicas are doing much more work than your 
high-temperature replicas. Unfortunately 4.5.3 doesn't allow the user to 
know enough detail here. Future versions of GROMACS will - work in progress.

Strictly, though, your rate-limiting lowest temperature replica in the 
24-replica regime should take an amount of time comparable to that of 
the lowest in the 42-replica regime (22K difference is not that 
significant) - and similar to a run other than as part of a 
replica-exchange simulation. Your reported data is not consistent with 
that, so I think your jobs are also experiencing differing degrees of 
network or filesystem contention at different times. Your sysadmins can 
comment on that.

Mark

> Computing:NodesNumberG-CyclesSeconds%
>
> -----------------------------------------------------------------------
>
> Domain decomp.44422.6041.20.0
>
> DD comm. load460.0010.00.0
>
> Comm. coord.422011.1450.50.0
>
> Neighbor search444214.9647.10.2
>
> Force42201175.30383.52.0
>
> Wait + Comm. F422011.2450.60.0
>
> PME mesh4220130.31414.40.3
>
> Write traj.41117.3468.30.2
>
> Update422012.0041.00.0
>
> Constraints4220126.59312.70.3
>
> Comm. energies444228.72213.70.3
>
> Rest48426.0294012.496.6
>
> -----------------------------------------------------------------------
>
> Total48726.2704155.4100.0
>
>
>
> Qiong
>
> On 7/02/2011 9:52 PM, Qiong Zhang wrote:
>>
>> Dear all gmx-users,
>>
>> I have recently been testing the REMD simulations. I was running 
>> simulations on a supercomputer systembased on the AMD Opteron 12-core 
>> (2.1 GHz) processors. The Gromacs 4.5.3 version was used.
>>
>> I have a system of 5172 atoms, of which 138 atoms belong to solute 
>> and the other are water molecules. An exponential distribution of 
>> temperatures was generated ranging from 276 to 515 K in total of 42 
>> replicas or from 298 to 420 K in total of 24 replicas, ensuring that 
>> the exchange ratio between all adjacent replicas is about 0.25. The 
>> replica exchange was carried out every 0.5ps. The integrate step size 
>> was 2fs.
>>
>> For the above system, when REMD is simulated over 24 replicas, the 
>> simulation speed is reasonably fast. However, when REMD is simulated 
>> over 42 replicas, the simulation speed is awfully slow.Please see the 
>> following table for the speed.
>>
>> ----------------------------------------------------------------------------
>>
>> Replica numberCPU numberspeed
>>
>> 249658015steps/15minutes
>>
>> 4242865steps/15minutes
>>
>> 42841175steps/15minutes
>>
>> 421681875steps/15minutes
>>
>> 423362855steps/15minutes
>>
>> The command line for the mdrun is:
>>
>> aprun -n (CPU number here) mdrun_d -s md.tpr -multi (replica number 
>> here) -replex 250
>>
>> My questions are :
>>
>> 1) why the REMD for the 42 replicas is so slow for the same system?
>>
>> 2) On what aspects can I improve the operating efficiency please?
>>
>
> What's the network hardware? Can other machine load influence your 
> network performance?
>
> Are the systems in the NVT ensemble? Use diff to check the .mdp files 
> differ only how you think they do.
>
> What are the values of nstlist and nstcalcenergy?
>
> Take a look at the execution time breakdown at the end of the .log 
> files, and do so for more than one replica. With the current 
> implementation, every simulation has to synchronize and communicate 
> every handful of steps, which means that large scale parallelism won't 
> work efficiently unless you have fast network hardware that is 
> dedicated to your job. This effect shows up in the "Rest" row of the 
> time breakdown. With Infiniband, I'd expect you should only be losing 
> about 10% of the run time total. The 30-fold loss you have upon going 
> from 24->42 replicas keeping 4 CPUs/replica suggests some other 
> contribution, however.
>
> Mark
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110208/3d36c952/attachment.html>


More information about the gromacs.org_gmx-users mailing list