[gmx-users] Some Scaling of 5.0 Results
Szilárd Páll
pall.szilard at gmail.com
Mon Sep 22 21:10:17 CEST 2014
Additionally to Mark's comments, let me ask/add a couple of things.
What was your benchmarking procedure on core counts that represent
less than a full socket?
Besides the thread affinity issue mentioned by Mark, clock frequency
scaling (boost) can also distort performance plots. You will observe
artificially high performance on small core counts making the scaling
inherently worse - unless this is explicitly turned off in the
BIOS/firmware. This can be further enhanced by the low cache traffic
when only partially using a multicore CPU. These are both artificial
effects that you won't see in real-world runs - unless leaving a bunch
of cores empty.
There is no single "right way" to avoid these issues, there certainly
are ways to present data in a less than useful manner - especially
when it comes to scaling plots. A simple way of avoiding such issues
and eliminating the potential for incorrect strong scaling plots is to
start from at least a socket (or node). Otherwise, IMO the <8 threads
data points on your plot make sense only if you show strong scaling to
multiple sockets/nodes by using the same amount of threads per socket
as you started with, leaving the rest of the cores free.
What run configuration did you use for Verlet on single node? With the
Verlet scheme no domain decomposition, that is multithreding-only
(OpenMP) runs are typically more efficient than using
domain-decomposition. This is typically true up to a full socket and
quite often even across two Intel sockets.
Did you tune the PME performance, i.e. the number of separate PME ranks?
Did you use nstlist=40 for all Verlet data points? That may not be
optimal across all node counts, especially on less than two nodes, but
of course that's hard to tell without trying!
Finally, looking at octanol Verlet plot, especially in comparison with
the water plot, what's strange is that the scaling efficiency is much
worse than with water and varies quite greatly between neighboring
data points. This indicates that something was not entirely right with
those runs.
Cheers,
--
Szilárd
On Fri, Sep 19, 2014 at 1:35 PM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> On Fri, Sep 19, 2014 at 2:50 AM, Dallas Warren <Dallas.Warren at monash.edu>
> wrote:
>
>> Some scaling results that might be of interest to some people.
>>
>> Machine = Barcoo @ VLSCI
>> 2.7GHz Intel Sandybridge cores
>> 256GB RAM
>> 16 cores per node
>> Mellanox FDR14 InfiniBand switch
>>
>> Systems = water and octanol only with GROMOS53a6
>>
>> # Atoms = 10,000 to 1,000,000
>>
>> Comparison Group versus Verlet neighbour searching
>>
>> Image/graphs see https://twitter.com/dr_dbw/status/512763354566254592
>>
>> Basically group neighbour searching for this setup is faster and scales
>> better than Verlet. Was expecting that to be the case with water, since it
>> is mentioned somewhere that is the case. However for the pure octanol
>> system was expecting it to be the other way around?
>>
>
> Thanks for sharing. Since the best way to write code that scales well is to
> write code that runs slowly, we generally prefer to look at raw ns/day.
> Choosing between perfect scaling of implementation A at 10 ns/day and
> imperfect scaling of implementation B starting at 50 ns/day is a
> no-brainer, but only if you know the throughput.
>
> I'd also be very suspicious of your single-core result, based on your
> super-linear scaling. When using a number of cores smaller than a node, you
> need to take care to pin that thread (mdrun -pin on), and not having other
> processes also running on that core/node. If that result is noisy because
> it ran into different other stuff over time, then every "scaling" data
> point is affected.
>
> Also, to observe the scaling benefits of the Verlet scheme, you have to get
> involved with using OpenMP as the core count gets higher, since the whole
> point is that it permits more than one core to share the work of a domain,
> and the (short-ranged part of the) group scheme hasn't been implemented to
> do that. Since you don't mention OpenMP, you're probably not using it ;-)
> Similarly, the group scheme is unbuffered by default, so it's an
> apples-and-oranges comparison unless you state what buffer you used there.
>
> Cheers,
>
> Mark
>
> Catch ya,
>>
>> Dr. Dallas Warren
>> Drug Delivery, Disposition and Dynamics
>> Monash Institute of Pharmaceutical Sciences, Monash University
>> 381 Royal Parade, Parkville VIC 3052
>> dallas.warren at monash.edu
>> +61 3 9903 9304
>> ---------------------------------
>> When the only tool you own is a hammer, every problem begins to resemble a
>> nail.
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list