[gmx-users] Re: Question about scaling

Tue Nov 13 14:46:41 CET 2012

Hi,

On Nov 13, 2012, at 2:22 PM, Thomas Schlesier <schlesi at uni-mainz.de> wrote:

> Sorry for reposting, but forgot one comment and added it now below:
> 
> Am 13.11.2012 06:16, schrieb gmx-users-request at gromacs.org:
> >> Dear all,
> >> >i did some scaling tests for a cluster and i'm a little bit clueless about the results.
> >> >So first the setup:
> >> >
> >> >Cluster:
> >> >Saxonid 6100, Opteron 6272 16C 2.100GHz, Infiniband QDR
> >> >GROMACS version: 4.0.7 and 4.5.5
> >> >Compiler: 	GCC 4.7.0
> >> >MPI: Intel MPI 4.0.3.008
> >> >FFT-library: ACML 5.1.0 fma4
> >> >
> >> >System:
> >> >895 spce water molecules
> > this is a somewhat small system I would say.
> >
> >> >Simulation time: 750 ps (0.002 fs timestep)
> >> >Cut-off: 1.0 nm
> >> >but with long-range correction ( DispCorr = EnerPres ; PME (standard settings) - but in each case no extra CPU solely for PME)
> >> >V-rescale thermostat and Parrinello-Rahman barostat
> >> >
> >> >I get the following timings (seconds), whereas is calculated as the time which would be needed for 1 CPU (so if a job on 2 CPUs took X s the time would be 2 * X s).
> >> >These timings were taken from the *.log file, at the end of the
> >> >'real cycle and time accounting' - section.
> >> >
> >> >Timings:
> >> >gmx-version	1cpu	2cpu	4cpu
> >> >4.0.7		4223	3384	3540
> >> >4.5.5		3780	3255	2878
> > Do you mean CPUs or CPU cores? Are you using the IB network or are you running single-node?
> 
> Meant number of cores and all cores are on the same node.
> 
> >
> >> >
> >> >I'm a little bit clueless about the results. I always thought, that if i have a non-interacting system and double the amount of CPUs, i
> > You do use PME, which means a global interaction of all charges.
> >
> >> >would get a simulation which takes only half the time (so the times as defined above would be equal). If the system does have interactions, i would lose some performance due to communication. Due to node imbalance there could be a further loss of performance.
> >> >
> >> >Keeping this in mind, i can only explain the timings for version 4.0.7 2cpu -> 4cpu (2cpu a little bit faster, since going to 4cpu leads to more communication -> loss of performance).
> >> >
> >> >All the other timings, especially that 1cpu takes in each case longer than the other cases, i do not understand.
> >> >Probalby the system is too small and / or the simulation time is too short for a scaling test. But i would assume that the amount of time to setup the simulation would be equal for all three cases of one GROMACS-version.
For somewhat cleaner benchmark numbers excluding any setup and load balancing equilibration time,
you can set the "-resethway" switch to mdrun. This way, it will only report timings for the last
half of the time steps.

> >> >Only other explaination, which comes to my mind, would be that something went wrong during the installation of the programs?
I think it is the small size of your system. Try a benchmark with e.g. 10k particles, only if that
looks as bad I would assume something is wrong with the installation.

Carsten

> > You might want to take a closer look at the timings in the md.log output files, this will
> > give you a clue where the bottleneck is, and also tell you about the communication-computation
> > ratio.
> >
> > Best,
> >    Carsten
> >
> >
> >> >
> >> >Please, can somebody enlighten me?
> >> >
> 
> Here are the timings from the log-file (for GMX 4.5.5):
> 
> #cores:                 1       2       4	(all cores are on the same node)
>  Computing:
> --------------------
>  Domain decomp.                 41.7    47.8    up
>  DD comm. load                  0.0     0.0     -
>  Comm. coord.                   17.8    30.5    up
>  Neighbor search        614.1   355.4   323.7   down
>  Force                  2401.6  1968.7  1676.0  down
>  Wait + Comm. F                 15.1    31.4    up
>  PME mesh               596.3   710.4   639.1   -
>  Write traj.            1.2     0.8     0.6     down
>  Update                 49.7    44.0    37.6    down
>  Constraints            79.3    70.4    60.0    down
>  Comm. energies                 3.2     5.3     up
>  Rest                   38.3    27.1    25.4    down
> --------------------
>  Total                  3780.5  3254.6  2877.5  down
> --------------------
> --------------------
>  PME redist. X/F                133.0   120.5   down
>  PME spread/gather      511.3   465.7   396.8   down
>  PME 3D-FFT             59.4    88.9    102.2   up
>  PME solve              25.2    22.2    18.9    down
> --------------------
> 
> The two calculations-parts for which the most time is saved for going
> parallel are:
> 1) Forces
> 2) Neighbor search (ok, going from 2cores to 4cores does not make a big
> differences, but from 1core to 2 or 4 saves much time)
> 
> For GMX 4.0.7 ist looks similar, whereas the difference between 2 and 4 cores is not so high as for GMX 4.5.5
> 
> Is there any good explains for this time saving?
> I would have thought that the system has a set number of interaction and
> one has to calculate all these interactions. If i divide the set in 2 or
> 4 smaller sets, the number of interactions shouldn't change and so the
> calculation time shouldn't change?
> 
> Or is something fancy in the algorithm, which reducces the time spent
> for calling up the arrays if the calculation is for a smaller set of
> interactions?
> -- 
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/grubmueller/kutzner