[gmx-users] Re: Question about scaling

Tue Nov 13 13:55:00 CET 2012

Am 13.11.2012 06:16, schrieb gmx-users-request at gromacs.org:
>> Dear all,
>> >i did some scaling tests for a cluster and i'm a little bit clueless about the results.
>> >So first the setup:
>> >
>> >Cluster:
>> >Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
>> >GROMACS version: 4.0.7 and 4.5.5
>> >Compiler: 	GCC 4.7.0
>> >MPI: Intel MPI 4.0.3.008
>> >FFT-library: ACML 5.1.0 fma4
>> >
>> >System:
>> >895 spce water molecules
> this is a somewhat small system I would say.
>
>> >Simulation time: 750 ps (0.002 fs timestep)
>> >Cut-off: 1.0 nm
>> >but with long-range correction ( DispCorr = EnerPres ; PME (standard settings) - but in each case no extra CPU solely for PME)
>> >V-rescale thermostat and Parrinello-Rahman barostat
>> >
>> >I get the following timings (seconds), whereas is calculated as the time which would be needed for 1 CPU (so if a job on 2 CPUs took X s the time would be 2 * X s).
>> >These timings were taken from the *.log file, at the end of the
>> >'real cycle and time accounting' - section.
>> >
>> >Timings:
>> >gmx-version	1cpu	2cpu	4cpu
>> >4.0.7		4223	3384	3540
>> >4.5.5		3780	3255	2878
> Do you mean CPUs or CPU cores? Are you using the IB network or are you running single-node?

Meant number of cores and all cores are on the same node.

>
>> >
>> >I'm a little bit clueless about the results. I always thought, that if i have a non-interacting system and double the amount of CPUs, i
> You do use PME, which means a global interaction of all charges.
>
>> >would get a simulation which takes only half the time (so the times as defined above would be equal). If the system does have interactions, i would lose some performance due to communication. Due to node imbalance there could be a further loss of performance.
>> >
>> >Keeping this in mind, i can only explain the timings for version 4.0.7 2cpu -> 4cpu (2cpu a little bit faster, since going to 4cpu leads to more communication -> loss of performance).
>> >
>> >All the other timings, especially that 1cpu takes in each case longer than the other cases, i do not understand.
>> >Probalby the system is too small and / or the simulation time is too short for a scaling test. But i would assume that the amount of time to setup the simulation would be equal for all three cases of one GROMACS-version.
>> >Only other explaination, which comes to my mind, would be that something went wrong during the installation of the programs?
> You might want to take a closer look at the timings in the md.log output files, this will
> give you a clue where the bottleneck is, and also tell you about the communication-computation
> ratio.
>
> Best,
>    Carsten
>
>
>> >
>> >Please, can somebody enlighten me?
>> >

Here are the timings from the log-file:

#cores:                 1       2       4	(all cores are on the same node)
  Computing:
--------------------
  Domain decomp.                 41.7    47.8    up
  DD comm. load                  0.0     0.0     -
  Comm. coord.                   17.8    30.5    up
  Neighbor search        614.1   355.4   323.7   down
  Force                  2401.6  1968.7  1676.0  down
  Wait + Comm. F                 15.1    31.4    up
  PME mesh               596.3   710.4   639.1   -
  Write traj.            1.2     0.8     0.6     down
  Update                 49.7    44.0    37.6    down
  Constraints            79.3    70.4    60.0    down
  Comm. energies                 3.2     5.3     up
  Rest                   38.3    27.1    25.4    down
--------------------
  Total                  3780.5  3254.6  2877.5  down
--------------------
--------------------
  PME redist. X/F                133.0   120.5   down
  PME spread/gather      511.3   465.7   396.8   down
  PME 3D-FFT             59.4    88.9    102.2   up
  PME solve              25.2    22.2    18.9    down
--------------------

The two calculations-parts for which the most time is saved for going 
parallel are:
1) Forces
2) Neighbor search (ok, going from 2cores to 4cores does not make a big 
differences, but from 1core to 2 or 4 saves much time)

Is there any good explains for this time saving?
I would have thought that the system has a set number of interaction and 
one has to calculate all these interactions. If i divide the set in 2 or 
4 smaller sets, the number of interactions shouldn't change and so the 
calculation time shouldn't change?

Or is something fancy in the algorithm, which reducces the time spent 
for calling up the arrays if the calculation is for a smaller set of 
interactions?