[gmx-users] Re: Question about scaling

Tue Nov 13 14:22:41 CET 2012

Sorry for reposting, but forgot one comment and added it now below:

Am 13.11.2012 06:16, schrieb gmx-users-request at gromacs.org:
 >> Dear all,
 >> >i did some scaling tests for a cluster and i'm a little bit 
clueless about the results.
 >> >So first the setup:
 >> >
 >> >Cluster:
 >> >Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
 >> >GROMACS version: 4.0.7 and 4.5.5
 >> >Compiler: 	GCC 4.7.0
 >> >MPI: Intel MPI 4.0.3.008
 >> >FFT-library: ACML 5.1.0 fma4
 >> >
 >> >System:
 >> >895 spce water molecules
 > this is a somewhat small system I would say.
 >
 >> >Simulation time: 750 ps (0.002 fs timestep)
 >> >Cut-off: 1.0 nm
 >> >but with long-range correction ( DispCorr = EnerPres ; PME 
(standard settings) - but in each case no extra CPU solely for PME)
 >> >V-rescale thermostat and Parrinello-Rahman barostat
 >> >
 >> >I get the following timings (seconds), whereas is calculated as the 
time which would be needed for 1 CPU (so if a job on 2 CPUs took X s the 
time would be 2 * X s).
 >> >These timings were taken from the *.log file, at the end of the
 >> >'real cycle and time accounting' - section.
 >> >
 >> >Timings:
 >> >gmx-version	1cpu	2cpu	4cpu
 >> >4.0.7		4223	3384	3540
 >> >4.5.5		3780	3255	2878
 > Do you mean CPUs or CPU cores? Are you using the IB network or are 
you running single-node?

Meant number of cores and all cores are on the same node.

 >
 >> >
 >> >I'm a little bit clueless about the results. I always thought, that 
if i have a non-interacting system and double the amount of CPUs, i
 > You do use PME, which means a global interaction of all charges.
 >
 >> >would get a simulation which takes only half the time (so the times 
as defined above would be equal). If the system does have interactions, 
i would lose some performance due to communication. Due to node 
imbalance there could be a further loss of performance.
 >> >
 >> >Keeping this in mind, i can only explain the timings for version 
4.0.7 2cpu -> 4cpu (2cpu a little bit faster, since going to 4cpu leads 
to more communication -> loss of performance).
 >> >
 >> >All the other timings, especially that 1cpu takes in each case 
longer than the other cases, i do not understand.
 >> >Probalby the system is too small and / or the simulation time is 
too short for a scaling test. But i would assume that the amount of time 
to setup the simulation would be equal for all three cases of one 
GROMACS-version.
 >> >Only other explaination, which comes to my mind, would be that 
something went wrong during the installation of the programs?
 > You might want to take a closer look at the timings in the md.log 
output files, this will
 > give you a clue where the bottleneck is, and also tell you about the 
communication-computation
 > ratio.
 >
 > Best,
 >    Carsten
 >
 >
 >> >
 >> >Please, can somebody enlighten me?
 >> >

Here are the timings from the log-file (for GMX 4.5.5):

#cores:                 1       2       4	(all cores are on the same node)
   Computing:
--------------------
   Domain decomp.                 41.7    47.8    up
   DD comm. load                  0.0     0.0     -
   Comm. coord.                   17.8    30.5    up
   Neighbor search        614.1   355.4   323.7   down
   Force                  2401.6  1968.7  1676.0  down
   Wait + Comm. F                 15.1    31.4    up
   PME mesh               596.3   710.4   639.1   -
   Write traj.            1.2     0.8     0.6     down
   Update                 49.7    44.0    37.6    down
   Constraints            79.3    70.4    60.0    down
   Comm. energies                 3.2     5.3     up
   Rest                   38.3    27.1    25.4    down
--------------------
   Total                  3780.5  3254.6  2877.5  down
--------------------
--------------------
   PME redist. X/F                133.0   120.5   down
   PME spread/gather      511.3   465.7   396.8   down
   PME 3D-FFT             59.4    88.9    102.2   up
   PME solve              25.2    22.2    18.9    down
--------------------

The two calculations-parts for which the most time is saved for going
parallel are:
1) Forces
2) Neighbor search (ok, going from 2cores to 4cores does not make a big
differences, but from 1core to 2 or 4 saves much time)

For GMX 4.0.7 ist looks similar, whereas the difference between 2 and 4 
cores is not so high as for GMX 4.5.5

Is there any good explains for this time saving?
I would have thought that the system has a set number of interaction and
one has to calculate all these interactions. If i divide the set in 2 or
4 smaller sets, the number of interactions shouldn't change and so the
calculation time shouldn't change?

Or is something fancy in the algorithm, which reducces the time spent
for calling up the arrays if the calculation is for a smaller set of
interactions?