[gmx-users] Simulation time losses with REMD

Fri Jan 28 14:13:37 CET 2011

On Fri, 2011-01-28 at 16:46 +1100, Mark Abraham wrote:
> Hi,
> 
> I compared the .log file time accounting for same .tpr file run alone in 
> serial or as part of an REMD simulation (with each replica on a single 
> proessor). It ran about 5-10% slower in the latter. The effect was a bit 
> larger when comparing the same .tpr on 8 processors with REMD with 8 
> processers per replica. The effect seems fairly independent of whether I 
> compare the lowest or highest replica.
> 
> The system is 1ns of Ace-(Ala)_10-NME in CHARMM27 with GROMACS 4.5.3 
> using NVT, PME, virtual sites, 4fs timesteps, rlist=rvdw=rcoulomb=1.0nm 
> with REMD ranging over 20 replicas distributed exponentially from 298K 
> to 431.57K using v-rescale T-coupling. The machine has two quad-core 
> processors per node with Inifiniband connection. The Infiniband switch 
> is shared with other users' calculations, so some load-based variability 
> can and does occur, but this should have shown up in a named part of the 
> time accounting.
> 
> My first thought was that REMD exchange latency was to blame, so I 
> quickly hacked in a change to report the length of time spent in the 
> REMD initialization routine, and then each call to the REMD 
> exchange-attempt routine.
> 
> Comparing the performance between REMD and serial of the lowest replica 
> on a single processor, I saw with diff:
>     Computing:         Nodes     Number     G-Cycles    Seconds     %
> 7394,7403c6910,6918
> <  Vsite constr.          1     250001       40.271       13.8     0.7
> <  Neighbor search        1      25011      434.982      148.7     7.1
> <  Force                  1     250001     3607.375     1232.8    59.1
> <  PME mesh               1     250001     1270.407      434.1    20.8
> <  Vsite spread           1     500002       41.671       14.2     0.7
> <  Write traj.            1          3        7.873        2.7     0.1
> <  Update                 1     250001       82.822       28.3     1.4
> <  Constraints            1     250001      154.231       52.7     2.5
> <  REMD                   1        100       59.070       20.2     1.0
> <  Rest                   1                 409.862      140.1     6.7
> ---
>  >  Vsite constr.          1     250001       40.526       13.8     0.7
>  >  Neighbor search        1      25001      434.871      148.6     7.5
>  >  Force                  1     250001     3601.463     1230.8    62.2
>  >  PME mesh               1     250001     1292.675      441.8    22.3
>  >  Vsite spread           1     500002       41.479       14.2     0.7
>  >  Write traj.            1          3       17.153        5.9     0.3
>  >  Update                 1     250001       82.114       28.1     1.4
>  >  Constraints            1     250001      154.426       52.8     2.7
>  >  Rest                   1                 122.023       41.7     2.1
> 7405c6920
> <  Total                  1                6108.562     2087.5   100.0
> ---
>  >  Total                  1                5786.731     1977.5   100.0
> 
> So "Rest" goes up from 122 s to 409 s under REMD, even after factoring 
> out the 59 s actually spent in REMD. With the highest replica:
> 
>     Computing:         Nodes     Number     G-Cycles    Seconds     %
> 7394,7403c6910,6918
> <  Vsite constr.          1     250001       40.261       13.8     0.7
> <  Neighbor search        1      25016      434.878      148.6     7.1
> <  Force                  1     250001     3606.913     1232.6    59.0
> <  PME mesh               1     250001     1264.716      432.2    20.7
> <  Vsite spread           1     500002       41.268       14.1     0.7
> <  Write traj.            1          3        7.113        2.4     0.1
> <  Update                 1     250001       82.491       28.2     1.4
> <  Constraints            1     250001      153.207       52.4     2.5
> <  REMD                   1        100       60.272       20.6     1.0
> <  Rest                   1                 417.399      142.6     6.8
> ---
>  >  Vsite constr.          1     250001       40.518       13.8     0.7
>  >  Neighbor search        1      25001      435.069      148.7     7.6
>  >  Force                  1     250001     3609.196     1233.4    62.6
>  >  PME mesh               1     250001     1283.082      438.5    22.3
>  >  Vsite spread           1     500002       41.825       14.3     0.7
>  >  Write traj.            1          3       13.063        4.5     0.2
>  >  Update                 1     250001       82.011       28.0     1.4
>  >  Constraints            1     250001      154.350       52.7     2.7
>  >  Rest                   1                 102.249       34.9     1.8
> 7405c6920
> <  Total                  1                6108.520     2087.5   100.0
> ---
>  >  Total                  1                5761.363     1968.8   100.0
> 
> Here 102 s becomes 417 s despite factoring out 60 s for REMD. So the 
> time spent doing the exchange is just noticeable, but quite a bit less 
> than the observed increase in total time.
> 
> For the lowest replica in parallel:
> 
> 8481,8496c7971,7985
> <  Domain decomp.         8      25010      152.338       52.1     1.8
> <  DD comm. load          8      24226        1.085        0.4     0.0
> <  DD comm. bounds        8      24219        4.167        1.4     0.0
> <  Vsite constr.          8     250001       62.857       21.5     0.8
> <  Comm. coord.           8     250001      132.068       45.1     1.6
> <  Neighbor search        8      25010      367.001      125.4     4.4
> <  Force                  8     250001     3446.528     1177.8    41.2
> <  Wait + Comm. F         8     250001      252.245       86.2     3.0
> <  PME mesh               8     250001     2113.009      722.1    25.3
> <  Vsite spread           8     500002      102.749       35.1     1.2
> <  Write traj.            8          1        1.206        0.4     0.0
> <  Update                 8     250001       85.793       29.3     1.0
> <  Constraints            8     250001      464.294      158.7     5.5
> <  Comm. energies         8     250002       73.343       25.1     0.9
> <  REMD                   8        100      162.661       55.6     1.9
> <  Rest                   8                 945.642      323.2    11.3
> ---
>  >  Domain decomp.         8      25001      146.561       50.1     2.0
>  >  DD comm. load          8      22943        0.989        0.3     0.0
>  >  DD comm. bounds        8      22901        3.768        1.3     0.1
>  >  Vsite constr.          8     250001       64.035       21.9     0.9
>  >  Comm. coord.           8     250001      124.487       42.5     1.7
>  >  Neighbor search        8      25001      367.342      125.5     5.0
>  >  Force                  8     250001     3443.161     1176.7    46.9
>  >  Wait + Comm. F         8     250001      237.697       81.2     3.2
>  >  PME mesh               8     250001     2119.205      724.2    28.9
>  >  Vsite spread           8     500002       95.092       32.5     1.3
>  >  Write traj.            8          1        0.920        0.3     0.0
>  >  Update                 8     250001       85.529       29.2     1.2
>  >  Constraints            8     250001      391.469      133.8     5.3
>  >  Comm. energies         8     250002      120.291       41.1     1.6
>  >  Rest                   8                 139.127       47.5     1.9
> 8498c7987
> <  Total                  8                8366.984     2859.3   100.0
> ---
>  >  Total                  8                7339.674     2508.3   100.0
> 
> Again REMD exchanges are only a small fraction of the increase (139 s to 
> 946 s despite 163 s accounted for).
> 
> Does anyone have a theory on what could be causing this?
> 
> Mark
> 

No theory, but some more data.

I've been running REMD on a fairly large system, with 48 replicas
between 300K and 400K. I have runs using Gromacs 4.5.3 and 2, 4 or 16
processors per replica. As a general statement, it all seems to scale
fine, and no great delays from the RE.

However, I did do some quick timing checks for the 2 procs per replica
case. I simply hacked in a few timing statements, so nothing so polished
as your hack :)

An average MD step takes about 0.3 s. The time spent in the replica
exchange attempt (which I took to be the time in the call to
replica_exchange() from md.c) was around 0.003 s, i.e. about 1% for a RE
cycle. Given that I only attempt an exchange every 1000 cycles, I took
this to be negligible.

The only odd thing I saw was that on a RE cycle it appears to spend 0.6s
in do_force() which is twice the average MD step time. I didn't print
this out for non-RE cycles, so no sanity check I am afraid.

For time lost in REMD, I guess the issue is when the replicas get
synchronised. There seems to be an MPI_Allreduce called as part of
get_replica_exchange() (when it collects the potential energies) which
is within my timings, but I am not sure if there is anything else.

Sorry that these figures are a bit rough and ready, but they do seem to
support your finding that the calls to REMD aren't to blame. 

Cheers
Martyn

-- 
***********************************************************************
*                                                                     *
*               Dr. Martyn Winn                                       *
*                                                                     *
*   STFC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, U.K.   *
*   Tel: +44 1925 603455    E-mail: martyn.winn at stfc.ac.uk            *
*   Fax: +44 1925 603634    Skype name: martyn.winn                   * 
*             URL: http://www.ccp4.ac.uk/martyn/                      *
***********************************************************************