[gmx-users] Re: Question about scaling
Thomas Schlesier
schlesi at uni-mainz.de
Tue Nov 13 14:22:41 CET 2012
Sorry for reposting, but forgot one comment and added it now below:
Am 13.11.2012 06:16, schrieb gmx-users-request at gromacs.org:
>> Dear all,
>> >i did some scaling tests for a cluster and i'm a little bit
clueless about the results.
>> >So first the setup:
>> >
>> >Cluster:
>> >Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
>> >GROMACS version: 4.0.7 and 4.5.5
>> >Compiler: GCC 4.7.0
>> >MPI: Intel MPI 4.0.3.008
>> >FFT-library: ACML 5.1.0 fma4
>> >
>> >System:
>> >895 spce water molecules
> this is a somewhat small system I would say.
>
>> >Simulation time: 750 ps (0.002 fs timestep)
>> >Cut-off: 1.0 nm
>> >but with long-range correction ( DispCorr = EnerPres ; PME
(standard settings) - but in each case no extra CPU solely for PME)
>> >V-rescale thermostat and Parrinello-Rahman barostat
>> >
>> >I get the following timings (seconds), whereas is calculated as the
time which would be needed for 1 CPU (so if a job on 2 CPUs took X s the
time would be 2 * X s).
>> >These timings were taken from the *.log file, at the end of the
>> >'real cycle and time accounting' - section.
>> >
>> >Timings:
>> >gmx-version 1cpu 2cpu 4cpu
>> >4.0.7 4223 3384 3540
>> >4.5.5 3780 3255 2878
> Do you mean CPUs or CPU cores? Are you using the IB network or are
you running single-node?
Meant number of cores and all cores are on the same node.
>
>> >
>> >I'm a little bit clueless about the results. I always thought, that
if i have a non-interacting system and double the amount of CPUs, i
> You do use PME, which means a global interaction of all charges.
>
>> >would get a simulation which takes only half the time (so the times
as defined above would be equal). If the system does have interactions,
i would lose some performance due to communication. Due to node
imbalance there could be a further loss of performance.
>> >
>> >Keeping this in mind, i can only explain the timings for version
4.0.7 2cpu -> 4cpu (2cpu a little bit faster, since going to 4cpu leads
to more communication -> loss of performance).
>> >
>> >All the other timings, especially that 1cpu takes in each case
longer than the other cases, i do not understand.
>> >Probalby the system is too small and / or the simulation time is
too short for a scaling test. But i would assume that the amount of time
to setup the simulation would be equal for all three cases of one
GROMACS-version.
>> >Only other explaination, which comes to my mind, would be that
something went wrong during the installation of the programs?
> You might want to take a closer look at the timings in the md.log
output files, this will
> give you a clue where the bottleneck is, and also tell you about the
communication-computation
> ratio.
>
> Best,
> Carsten
>
>
>> >
>> >Please, can somebody enlighten me?
>> >
Here are the timings from the log-file (for GMX 4.5.5):
#cores: 1 2 4 (all cores are on the same node)
Computing:
--------------------
Domain decomp. 41.7 47.8 up
DD comm. load 0.0 0.0 -
Comm. coord. 17.8 30.5 up
Neighbor search 614.1 355.4 323.7 down
Force 2401.6 1968.7 1676.0 down
Wait + Comm. F 15.1 31.4 up
PME mesh 596.3 710.4 639.1 -
Write traj. 1.2 0.8 0.6 down
Update 49.7 44.0 37.6 down
Constraints 79.3 70.4 60.0 down
Comm. energies 3.2 5.3 up
Rest 38.3 27.1 25.4 down
--------------------
Total 3780.5 3254.6 2877.5 down
--------------------
--------------------
PME redist. X/F 133.0 120.5 down
PME spread/gather 511.3 465.7 396.8 down
PME 3D-FFT 59.4 88.9 102.2 up
PME solve 25.2 22.2 18.9 down
--------------------
The two calculations-parts for which the most time is saved for going
parallel are:
1) Forces
2) Neighbor search (ok, going from 2cores to 4cores does not make a big
differences, but from 1core to 2 or 4 saves much time)
For GMX 4.0.7 ist looks similar, whereas the difference between 2 and 4
cores is not so high as for GMX 4.5.5
Is there any good explains for this time saving?
I would have thought that the system has a set number of interaction and
one has to calculate all these interactions. If i divide the set in 2 or
4 smaller sets, the number of interactions shouldn't change and so the
calculation time shouldn't change?
Or is something fancy in the algorithm, which reducces the time spent
for calling up the arrays if the calculation is for a smaller set of
interactions?
More information about the gromacs.org_gmx-users
mailing list