[gmx-users] Re: Question about scaling
ckutzne at gwdg.de
Tue Nov 13 14:46:41 CET 2012
On Nov 13, 2012, at 2:22 PM, Thomas Schlesier <schlesi at uni-mainz.de> wrote:
> Sorry for reposting, but forgot one comment and added it now below:
> Am 13.11.2012 06:16, schrieb gmx-users-request at gromacs.org:
> >> Dear all,
> >> >i did some scaling tests for a cluster and i'm a little bit clueless about the results.
> >> >So first the setup:
> >> >
> >> >Cluster:
> >> >Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
> >> >GROMACS version: 4.0.7 and 4.5.5
> >> >Compiler: GCC 4.7.0
> >> >MPI: Intel MPI 4.0.3.008
> >> >FFT-library: ACML 5.1.0 fma4
> >> >
> >> >System:
> >> >895 spce water molecules
> > this is a somewhat small system I would say.
> >> >Simulation time: 750 ps (0.002 fs timestep)
> >> >Cut-off: 1.0 nm
> >> >but with long-range correction ( DispCorr = EnerPres ; PME (standard settings) - but in each case no extra CPU solely for PME)
> >> >V-rescale thermostat and Parrinello-Rahman barostat
> >> >
> >> >I get the following timings (seconds), whereas is calculated as the time which would be needed for 1 CPU (so if a job on 2 CPUs took X s the time would be 2 * X s).
> >> >These timings were taken from the *.log file, at the end of the
> >> >'real cycle and time accounting' - section.
> >> >
> >> >Timings:
> >> >gmx-version 1cpu 2cpu 4cpu
> >> >4.0.7 4223 3384 3540
> >> >4.5.5 3780 3255 2878
> > Do you mean CPUs or CPU cores? Are you using the IB network or are you running single-node?
> Meant number of cores and all cores are on the same node.
> >> >
> >> >I'm a little bit clueless about the results. I always thought, that if i have a non-interacting system and double the amount of CPUs, i
> > You do use PME, which means a global interaction of all charges.
> >> >would get a simulation which takes only half the time (so the times as defined above would be equal). If the system does have interactions, i would lose some performance due to communication. Due to node imbalance there could be a further loss of performance.
> >> >
> >> >Keeping this in mind, i can only explain the timings for version 4.0.7 2cpu -> 4cpu (2cpu a little bit faster, since going to 4cpu leads to more communication -> loss of performance).
> >> >
> >> >All the other timings, especially that 1cpu takes in each case longer than the other cases, i do not understand.
> >> >Probalby the system is too small and / or the simulation time is too short for a scaling test. But i would assume that the amount of time to setup the simulation would be equal for all three cases of one GROMACS-version.
For somewhat cleaner benchmark numbers excluding any setup and load balancing equilibration time,
you can set the "-resethway" switch to mdrun. This way, it will only report timings for the last
half of the time steps.
> >> >Only other explaination, which comes to my mind, would be that something went wrong during the installation of the programs?
I think it is the small size of your system. Try a benchmark with e.g. 10k particles, only if that
looks as bad I would assume something is wrong with the installation.
> > You might want to take a closer look at the timings in the md.log output files, this will
> > give you a clue where the bottleneck is, and also tell you about the communication-computation
> > ratio.
> > Best,
> > Carsten
> >> >
> >> >Please, can somebody enlighten me?
> >> >
> Here are the timings from the log-file (for GMX 4.5.5):
> #cores: 1 2 4 (all cores are on the same node)
> Domain decomp. 41.7 47.8 up
> DD comm. load 0.0 0.0 -
> Comm. coord. 17.8 30.5 up
> Neighbor search 614.1 355.4 323.7 down
> Force 2401.6 1968.7 1676.0 down
> Wait + Comm. F 15.1 31.4 up
> PME mesh 596.3 710.4 639.1 -
> Write traj. 1.2 0.8 0.6 down
> Update 49.7 44.0 37.6 down
> Constraints 79.3 70.4 60.0 down
> Comm. energies 3.2 5.3 up
> Rest 38.3 27.1 25.4 down
> Total 3780.5 3254.6 2877.5 down
> PME redist. X/F 133.0 120.5 down
> PME spread/gather 511.3 465.7 396.8 down
> PME 3D-FFT 59.4 88.9 102.2 up
> PME solve 25.2 22.2 18.9 down
> The two calculations-parts for which the most time is saved for going
> parallel are:
> 1) Forces
> 2) Neighbor search (ok, going from 2cores to 4cores does not make a big
> differences, but from 1core to 2 or 4 saves much time)
> For GMX 4.0.7 ist looks similar, whereas the difference between 2 and 4 cores is not so high as for GMX 4.5.5
> Is there any good explains for this time saving?
> I would have thought that the system has a set number of interaction and
> one has to calculate all these interactions. If i divide the set in 2 or
> 4 smaller sets, the number of interactions shouldn't change and so the
> calculation time shouldn't change?
> Or is something fancy in the algorithm, which reducces the time spent
> for calling up the arrays if the calculation is for a smaller set of
> gmx-users mailing list gmx-users at gromacs.org
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
More information about the gromacs.org_gmx-users