[gmx-users] Why REMD simulation becomes so slow when the number of replicas becomes large?
Qiong Zhang
qiongzhang928 at yahoo.com
Mon Feb 7 15:48:25 CET 2011
Hi Mark,
Many thanks for your fast response!
What's
the network hardware? Can other machine load influence your network
performance?
The supercomputer system is based on the
Cray Gemini interconnect technology. I suppose this is a fast network hardware...
Are
the systems in the NVT ensemble? Use diff to check the .mdp files differ only
how you think they do.
The systems are in NPT ensemble. I saw some
discussions on the mailing list that NPT ensemble is superior to NVT ensemble
for REMD. And the .mdp files differ only in the temperature.
What
are the values of nstlist and nstcalcenergy?
Previously, nstlist=5, nstcalcenergy=1
Thank you for
pointing this out. I checked the manual again that this option affects the
performance in parallel simulations because calculating energies requires global
communication between all processes. So I have set this option to -1 this time.
This should be one reason for the low parallel efficiency.
And after I
changed nstcalcenergy=-1, I found there was a 3% improvement on the efficiency compared with those when
nstcalcenergy=1.
Take a look at the execution time breakdown
at the end of the .log files, and do so for more than one replica. With the
current implementation, every simulation has to synchronize and communicate
every handful of steps, which means that large scale parallelism won't work
efficiently unless you have fast network hardware that
is dedicated to your job. This effect shows up in the "Rest" row of
the time breakdown. With Infiniband, I'd expect you should
only be losing about 10% of the run time total. The 30-fold loss you have upon
going from 24->42 replicas keeping 4 CPUs/replica suggests some other
contribution, however.
I checked the time breakdown in the log
files for short REMD simulations. For the REMD simulaiton with 168 cores for 42
replicas, as you see below, the “Rest” makes up as surprisingly high as 96.6% of the time for one of the
replicas. This parameter is almost the same level for the other replicas. For
the REMD simulation with 96 cores for 24 replicas, the “Rest” takes up about
24%. I was also aware of your post:
http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html
As you suggested such big loss should be
ascribed to other factors. Do you think it is the network hardware to blame or
there are other reasons please? Any suggestion would be greatly appreciated
Computing: Nodes Number
G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 4 442
2.604 1.2
0.0
DD
comm. load 4 6 0.001 0.0
0.0
Comm. coord. 4 2201 1.145 0.5
0.0
Neighbor search 4
442 14.964 7.1
0.2
Force 4
2201 175.303 83.5
2.0
Wait
+ Comm. F 4 2201 1.245 0.6
0.0
PME
mesh 4 2201 30.314 14.4
0.3
Write traj. 4 11 17.346 8.3
0.2
Update 4 2201 2.004 1.0
0.0
Constraints 4 2201 26.593 12.7
0.3
Comm. energies 4 442 28.722 13.7
0.3
Rest 4 8426.029 4012.4
96.6
-----------------------------------------------------------------------
Total 4 8726.270 4155.4
100.0
Qiong
On 7/02/2011 9:52 PM, Qiong Zhang wrote:
Dear all
gmx-users,
I have recently been
testing the REMD
simulations. I was running simulations on a
supercomputer system based
on the AMD Opteron 12-core (2.1 GHz) processors. The
Gromacs 4.5.3
version was used.
I have a system of
5172 atoms, of which 138
atoms belong to solute and the other are water
molecules. An exponential
distribution of temperatures was generated ranging
from 276 to 515 K in total
of 42 replicas or from 298 to 420 K in total of 24
replicas, ensuring that the
exchange ratio between all adjacent replicas is about
0.25. The replica
exchange was carried out every 0.5ps. The integrate
step size was 2fs.
For the above
system, when REMD is
simulated over 24 replicas, the simulation speed is
reasonably fast. However,
when REMD is simulated over 42 replicas, the
simulation speed is awfully slow.Please see the
following table for the speed.
----------------------------------------------------------------------------
Replica number CPU number speed
24
96 58015steps/15minutes
42
42 865steps/15minutes
42
84 1175steps/15minutes
42
168 1875steps/15minutes
42
336
2855steps/15minutes
The command line
for the mdrun
is:
aprun -n (CPU
number here) mdrun_d -s
md.tpr -multi (replica number here) -replex 250
My questions are :
1) why the REMD
for the 42
replicas is so slow for the same system?
2) On what aspects
can I improve the operating
efficiency please?
What's the network hardware? Can other machine load influence your
network performance?
Are the systems in the NVT ensemble? Use diff to check the .mdp
files differ only how you think they do.
What are the values of nstlist and nstcalcenergy?
Take a look at the execution time breakdown at the end of the .log
files, and do so for more than one replica. With the current
implementation, every simulation has to synchronize and communicate
every handful of steps, which means that large scale parallelism
won't work efficiently unless you have fast network hardware that is
dedicated to your job. This effect shows up in the "Rest" row of the
time breakdown. With Infiniband, I'd expect you should only be
losing about 10% of the run time total. The 30-fold loss you have
upon going from 24->42 replicas keeping 4 CPUs/replica suggests
some other contribution, however.
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110207/40fe9008/attachment.html>
More information about the gromacs.org_gmx-users
mailing list