[gmx-users] Re:Re:Why REMD simulation becomes so slow when the number of replicas becomes large?
Qiong Zhang
qiongzhang928 at yahoo.com
Tue Feb 8 10:04:39 CET 2011
Hi Mark,
Your analyses are
quite reasonable. The low-temperature replicas are indeed doing much more work
than the high-temperature replicas. As you said, the lowest temperature replica
in the 24-replica should take an amount of time comparable to that of the
lowest in the 42-replica. So for my case, the load imbalance across replicas is
only partly to blame. Now I can exclude factors from the REMD parameters
themselves. I will ask the system admin for possible explanations.
May I
ask when you do REMD with NVT ensemble, is it right that all your replicas are
running with the same Volume as the lowest-temperature replica? Or do you
equilibrate each replica with NPT ensemble, then NVT ensemble, and then feed
the equilibrated structures to NVT REMD simulations?
Thank
you for all your helpful suggestions!
Qiong
Hi Mark,
Many thanks for your fast response!
What's
the network hardware? Can other machine load
influence your network
performance?
The supercomputer
system is based on the
Cray Gemini interconnect technology. I suppose this is
a fast network hardware...
Are
the systems in the NVT ensemble? Use diff to check
the .mdp files differ only
how you think they do.
The systems are in
NPT ensemble. I saw some
discussions on the mailing list that NPT ensemble is
superior to NVT ensemble
for REMD. And the .mdp files differ only in the
temperature.
Maybe so, but under NPT the density varies with T, and so with
replica. This means the size of neighbour lists varies, and the cost
of the computation (PME or not) varies. The generalized ensemble is
limited by the progress of the slowest replica. If using PME, in
theory, you can juggle the contribution of the various terms to
balance the computation load across the replicas, but this is not
easy to do.
What
are the values of nstlist and
nstcalcenergy?
Previously,
nstlist=5, nstcalcenergy=1
Thank
you for
pointing this out. I checked the manual again that
this option affects the
performance in parallel simulations because
calculating energies requires global
communication between all processes. So I have set
this option to -1 this time.
This should be one reason for the low parallel
efficiency.
And
after I
changed nstcalcenergy=-1,
I found there was a
3% improvement on the efficiency compared with those
when
nstcalcenergy=1.
Yep. nstpcouple and nsttcouple also influence this.
Take a look at the execution time
breakdown
at the end of the .log files, and do so for more than
one replica. With the
current implementation, every simulation has to
synchronize and communicate
every handful of steps, which means that large scale
parallelism won't work
efficiently unless you have fast network hardware that
is dedicated to your job. This effect shows up in the
"Rest" row of
the time breakdown. With Infiniband,
I'd expect you should
only be losing about 10% of the run time total. The
30-fold loss you have upon
going from 24->42 replicas keeping 4 CPUs/replica
suggests some other
contribution, however.
I checked the time
breakdown in the log
files for short REMD simulations. For the REMD
simulaiton with 168 cores for 42
replicas, as you see below, the “Rest” makes up as
surprisingly high as 96.6% of
the time for one of the
replicas. This parameter is almost the same level for
the other replicas. For
the REMD simulation with 96 cores for 24 replicas, the
“Rest” takes up about
24%. I was also aware of your post:
http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html
As you suggested
such big loss should be
ascribed to other factors. Do you think it is the
network hardware to blame or
there are other reasons please? Any suggestion would
be greatly appreciated
I expect the load imbalance across replicas is partly to blame. Look
at the sum of Force + PME mesh (in seconds) across the generalized
ensemble. That's where the simulation work is all done, and I expect
your low-temperature replicas are doing much more work than your
high-temperature replicas. Unfortunately 4.5.3 doesn't allow the
user to know enough detail here. Future versions of GROMACS will -
work in progress.
Strictly, though, your rate-limiting lowest temperature replica in
the 24-replica regime should take an amount of time comparable to
that of the lowest in the 42-replica regime (22K difference is not
that significant) - and similar to a run other than as part of a
replica-exchange simulation. Your reported data is not consistent
with that, so I think your jobs are also experiencing differing
degrees of network or filesystem contention at different times. Your
sysadmins can comment on that.
Mark
Computing: Nodes Number
G-Cycles Seconds %
-----------------------------------------------------------------------
Domain
decomp. 4
442
2.604
1.2
0.0
DD
comm.
load 4
6 0.001 0.0
0.0
Comm.
coord. 4 2201 1.145 0.5
0.0
Neighbor
search 4
442 14.964 7.1
0.2
Force 4
2201 175.303 83.5
2.0
Wait
+
Comm. F 4
2201 1.245 0.6
0.0
PME
mesh 4 2201 30.314 14.4
0.3
Write
traj. 4 11 17.346 8.3
0.2
Update 4
2201 2.004 1.0
0.0
Constraints 4
2201 26.593 12.7
0.3
Comm.
energies 4 442 28.722 13.7
0.3
Rest 4
8426.029 4012.4
96.6
-----------------------------------------------------------------------
Total 4
8726.270 4155.4
100.0
Qiong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110208/2d6110d9/attachment.html>
More information about the gromacs.org_gmx-users
mailing list