[gmx-users] Why REMD simulation becomes so slow when the number of replicas becomes large?

Mon Feb 7 15:48:25 CET 2011

Hi Mark,

Many thanks for your fast response!

What's
the network hardware? Can other machine load influence your network
performance?

The supercomputer system is based on the
Cray Gemini interconnect technology. I suppose this is a fast network hardware...

Are
the systems in the NVT ensemble? Use diff to check the .mdp files differ only
how you think they do.

The systems are in NPT ensemble. I saw some
discussions on the mailing list that NPT ensemble is superior to NVT ensemble
for REMD. And the .mdp files differ only in the temperature.

What
are the values of nstlist and nstcalcenergy?

Previously, nstlist=5, nstcalcenergy=1

Thank you for
pointing this out. I checked the manual again that this option affects the
performance in parallel simulations because calculating energies requires global
communication between all processes. So I have set this option to -1 this time.
This should be one reason for the low parallel efficiency.

And after I
changed nstcalcenergy=-1, I found there was a 3% improvement on the efficiency compared with those when
nstcalcenergy=1.

Take a look at the execution time breakdown
at the end of the .log files, and do so for more than one replica. With the
current implementation, every simulation has to synchronize and communicate
every handful of steps, which means that large scale parallelism won't work
efficiently unless you have fast network hardware that
is dedicated to your job. This effect shows up in the "Rest" row of
the time breakdown. With Infiniband, I'd expect you should
only be losing about 10% of the run time total. The 30-fold loss you have upon
going from 24->42 replicas keeping 4 CPUs/replica suggests some other
contribution, however.

I checked the time breakdown in the log
files for short REMD simulations. For the REMD simulaiton with 168 cores for 42
replicas, as you see below, the “Rest” makes up as surprisingly high as 96.6% of the time for one of the
replicas. This parameter is almost the same level for the other replicas. For
the REMD simulation with 96 cores for 24 replicas, the “Rest” takes up about
24%. I was also aware of your post: 

http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html

As you suggested such big loss should be
ascribed to other factors. Do you think it is the network hardware to blame or
there are other reasons please? Any suggestion would be greatly appreciated

Computing:         Nodes     Number    
G-Cycles    Seconds     %

-----------------------------------------------------------------------

 Domain decomp.         4        442   
    2.604        1.2    
0.0

 DD
comm. load          4          6        0.001        0.0    
0.0

 Comm. coord.           4       2201        1.145        0.5    
0.0

 Neighbor search        4       
442       14.964        7.1    
0.2

 Force                  4      
2201      175.303       83.5    
2.0

 Wait
+ Comm. F         4       2201        1.245        0.6    
0.0

 PME
mesh               4       2201       30.314       14.4    
0.3

 Write traj.            4         11       17.346        8.3    
0.2

 Update                 4       2201        2.004        1.0    
0.0

 Constraints            4       2201       26.593       12.7    
0.3

 Comm. energies         4        442       28.722       13.7    
0.3

 Rest                   4                8426.029     4012.4   
96.6

-----------------------------------------------------------------------

 Total                  4                8726.270     4155.4  
100.0

Qiong

On 7/02/2011 9:52 PM, Qiong Zhang wrote:

              Dear all
                  gmx-users,

              I have recently been
                  testing the REMD
                  simulations. I was running simulations on a
                  supercomputer system based
                  on the AMD Opteron 12-core (2.1 GHz) processors. The
                  Gromacs 4.5.3
                  version was used.

              I have a system of
                  5172 atoms, of which 138
                  atoms belong to solute and the other are water
                  molecules. An exponential
                  distribution of temperatures was generated ranging
                  from 276 to 515 K in total
                  of 42 replicas or from 298 to 420 K in total of 24
                  replicas, ensuring that the
                  exchange ratio between all adjacent replicas is about
                  0.25. The replica
                  exchange was carried out every 0.5ps. The integrate
                  step size was 2fs.

              For the above
                  system, when REMD is
                  simulated over 24 replicas, the simulation speed is
                  reasonably fast. However,
                  when REMD is simulated over 42 replicas, the
                  simulation speed is awfully slow.Please see the
                  following table for the speed.

              ----------------------------------------------------------------------------
              Replica number    CPU number     speed
              24                                                    
                    96             58015steps/15minutes
              42                                                    
                    42             865steps/15minutes
              42                                                    
                    84             1175steps/15minutes
              42                                                 
                    168             1875steps/15minutes

                42           
                                    336           
                    2855steps/15minutes

              The command line
                  for the mdrun
                  is:
              aprun -n (CPU
                  number here) mdrun_d -s
                  md.tpr -multi (replica number here) -replex 250

              My questions are :

              1) why the REMD
                  for the 42
                  replicas is so slow for the same system? 

              2) On what aspects
                  can I improve the operating
                  efficiency please?

    What's the network hardware? Can other machine load influence your
    network performance?

    Are the systems in the NVT ensemble? Use diff to check the .mdp
    files differ only how you think they do.

    What are the values of nstlist and nstcalcenergy?

    Take a look at the execution time breakdown at the end of the .log
    files, and do so for more than one replica. With the current
    implementation, every simulation has to synchronize and communicate
    every handful of steps, which means that large scale parallelism
    won't work efficiently unless you have fast network hardware that is
    dedicated to your job. This effect shows up in the "Rest" row of the
    time breakdown. With Infiniband, I'd expect you should only be
    losing about 10% of the run time total. The 30-fold loss you have
    upon going from 24->42 replicas keeping 4 CPUs/replica suggests
    some other contribution, however.

    Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110207/40fe9008/attachment.html>