[gmx-users] Simulation time losses with REMD
Martyn Winn
martyn.winn at stfc.ac.uk
Fri Jan 28 14:13:37 CET 2011
On Fri, 2011-01-28 at 16:46 +1100, Mark Abraham wrote:
> Hi,
>
> I compared the .log file time accounting for same .tpr file run alone in
> serial or as part of an REMD simulation (with each replica on a single
> proessor). It ran about 5-10% slower in the latter. The effect was a bit
> larger when comparing the same .tpr on 8 processors with REMD with 8
> processers per replica. The effect seems fairly independent of whether I
> compare the lowest or highest replica.
>
> The system is 1ns of Ace-(Ala)_10-NME in CHARMM27 with GROMACS 4.5.3
> using NVT, PME, virtual sites, 4fs timesteps, rlist=rvdw=rcoulomb=1.0nm
> with REMD ranging over 20 replicas distributed exponentially from 298K
> to 431.57K using v-rescale T-coupling. The machine has two quad-core
> processors per node with Inifiniband connection. The Infiniband switch
> is shared with other users' calculations, so some load-based variability
> can and does occur, but this should have shown up in a named part of the
> time accounting.
>
> My first thought was that REMD exchange latency was to blame, so I
> quickly hacked in a change to report the length of time spent in the
> REMD initialization routine, and then each call to the REMD
> exchange-attempt routine.
>
> Comparing the performance between REMD and serial of the lowest replica
> on a single processor, I saw with diff:
> Computing: Nodes Number G-Cycles Seconds %
> 7394,7403c6910,6918
> < Vsite constr. 1 250001 40.271 13.8 0.7
> < Neighbor search 1 25011 434.982 148.7 7.1
> < Force 1 250001 3607.375 1232.8 59.1
> < PME mesh 1 250001 1270.407 434.1 20.8
> < Vsite spread 1 500002 41.671 14.2 0.7
> < Write traj. 1 3 7.873 2.7 0.1
> < Update 1 250001 82.822 28.3 1.4
> < Constraints 1 250001 154.231 52.7 2.5
> < REMD 1 100 59.070 20.2 1.0
> < Rest 1 409.862 140.1 6.7
> ---
> > Vsite constr. 1 250001 40.526 13.8 0.7
> > Neighbor search 1 25001 434.871 148.6 7.5
> > Force 1 250001 3601.463 1230.8 62.2
> > PME mesh 1 250001 1292.675 441.8 22.3
> > Vsite spread 1 500002 41.479 14.2 0.7
> > Write traj. 1 3 17.153 5.9 0.3
> > Update 1 250001 82.114 28.1 1.4
> > Constraints 1 250001 154.426 52.8 2.7
> > Rest 1 122.023 41.7 2.1
> 7405c6920
> < Total 1 6108.562 2087.5 100.0
> ---
> > Total 1 5786.731 1977.5 100.0
>
> So "Rest" goes up from 122 s to 409 s under REMD, even after factoring
> out the 59 s actually spent in REMD. With the highest replica:
>
> Computing: Nodes Number G-Cycles Seconds %
> 7394,7403c6910,6918
> < Vsite constr. 1 250001 40.261 13.8 0.7
> < Neighbor search 1 25016 434.878 148.6 7.1
> < Force 1 250001 3606.913 1232.6 59.0
> < PME mesh 1 250001 1264.716 432.2 20.7
> < Vsite spread 1 500002 41.268 14.1 0.7
> < Write traj. 1 3 7.113 2.4 0.1
> < Update 1 250001 82.491 28.2 1.4
> < Constraints 1 250001 153.207 52.4 2.5
> < REMD 1 100 60.272 20.6 1.0
> < Rest 1 417.399 142.6 6.8
> ---
> > Vsite constr. 1 250001 40.518 13.8 0.7
> > Neighbor search 1 25001 435.069 148.7 7.6
> > Force 1 250001 3609.196 1233.4 62.6
> > PME mesh 1 250001 1283.082 438.5 22.3
> > Vsite spread 1 500002 41.825 14.3 0.7
> > Write traj. 1 3 13.063 4.5 0.2
> > Update 1 250001 82.011 28.0 1.4
> > Constraints 1 250001 154.350 52.7 2.7
> > Rest 1 102.249 34.9 1.8
> 7405c6920
> < Total 1 6108.520 2087.5 100.0
> ---
> > Total 1 5761.363 1968.8 100.0
>
> Here 102 s becomes 417 s despite factoring out 60 s for REMD. So the
> time spent doing the exchange is just noticeable, but quite a bit less
> than the observed increase in total time.
>
> For the lowest replica in parallel:
>
> 8481,8496c7971,7985
> < Domain decomp. 8 25010 152.338 52.1 1.8
> < DD comm. load 8 24226 1.085 0.4 0.0
> < DD comm. bounds 8 24219 4.167 1.4 0.0
> < Vsite constr. 8 250001 62.857 21.5 0.8
> < Comm. coord. 8 250001 132.068 45.1 1.6
> < Neighbor search 8 25010 367.001 125.4 4.4
> < Force 8 250001 3446.528 1177.8 41.2
> < Wait + Comm. F 8 250001 252.245 86.2 3.0
> < PME mesh 8 250001 2113.009 722.1 25.3
> < Vsite spread 8 500002 102.749 35.1 1.2
> < Write traj. 8 1 1.206 0.4 0.0
> < Update 8 250001 85.793 29.3 1.0
> < Constraints 8 250001 464.294 158.7 5.5
> < Comm. energies 8 250002 73.343 25.1 0.9
> < REMD 8 100 162.661 55.6 1.9
> < Rest 8 945.642 323.2 11.3
> ---
> > Domain decomp. 8 25001 146.561 50.1 2.0
> > DD comm. load 8 22943 0.989 0.3 0.0
> > DD comm. bounds 8 22901 3.768 1.3 0.1
> > Vsite constr. 8 250001 64.035 21.9 0.9
> > Comm. coord. 8 250001 124.487 42.5 1.7
> > Neighbor search 8 25001 367.342 125.5 5.0
> > Force 8 250001 3443.161 1176.7 46.9
> > Wait + Comm. F 8 250001 237.697 81.2 3.2
> > PME mesh 8 250001 2119.205 724.2 28.9
> > Vsite spread 8 500002 95.092 32.5 1.3
> > Write traj. 8 1 0.920 0.3 0.0
> > Update 8 250001 85.529 29.2 1.2
> > Constraints 8 250001 391.469 133.8 5.3
> > Comm. energies 8 250002 120.291 41.1 1.6
> > Rest 8 139.127 47.5 1.9
> 8498c7987
> < Total 8 8366.984 2859.3 100.0
> ---
> > Total 8 7339.674 2508.3 100.0
>
> Again REMD exchanges are only a small fraction of the increase (139 s to
> 946 s despite 163 s accounted for).
>
> Does anyone have a theory on what could be causing this?
>
> Mark
>
No theory, but some more data.
I've been running REMD on a fairly large system, with 48 replicas
between 300K and 400K. I have runs using Gromacs 4.5.3 and 2, 4 or 16
processors per replica. As a general statement, it all seems to scale
fine, and no great delays from the RE.
However, I did do some quick timing checks for the 2 procs per replica
case. I simply hacked in a few timing statements, so nothing so polished
as your hack :)
An average MD step takes about 0.3 s. The time spent in the replica
exchange attempt (which I took to be the time in the call to
replica_exchange() from md.c) was around 0.003 s, i.e. about 1% for a RE
cycle. Given that I only attempt an exchange every 1000 cycles, I took
this to be negligible.
The only odd thing I saw was that on a RE cycle it appears to spend 0.6s
in do_force() which is twice the average MD step time. I didn't print
this out for non-RE cycles, so no sanity check I am afraid.
For time lost in REMD, I guess the issue is when the replicas get
synchronised. There seems to be an MPI_Allreduce called as part of
get_replica_exchange() (when it collects the potential energies) which
is within my timings, but I am not sure if there is anything else.
Sorry that these figures are a bit rough and ready, but they do seem to
support your finding that the calls to REMD aren't to blame.
Cheers
Martyn
--
***********************************************************************
* *
* Dr. Martyn Winn *
* *
* STFC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, U.K. *
* Tel: +44 1925 603455 E-mail: martyn.winn at stfc.ac.uk *
* Fax: +44 1925 603634 Skype name: martyn.winn *
* URL: http://www.ccp4.ac.uk/martyn/ *
***********************************************************************
More information about the gromacs.org_gmx-users
mailing list