[gmx-users] No performance increase with single vs multiple nodes
Mark Abraham
mark.j.abraham at gmail.com
Wed Oct 25 10:21:47 CEST 2017
Hi,
On Wed, Oct 25, 2017 at 4:24 AM Matthew W Hanley <mwhanley at syr.edu> wrote:
> > There's several dozen lines of performance analysis at the end of the log
>
> > file, which you need to inspect and compare if you want to start to
> > understand what is going on :-)
>
> Thank you for the feedback. Fair warning, I'm more of a system
> administrator than a regular gromacs user. What is it that I should be
> focused on, and more importantly how do I find the bottleneck? Gromacs
> does recommend using AVX2_256, but I was unable to get Gromacs to build
> using that.
That's your first thing to do, then :-) Presumably you need an updated
toolchain (e.g. devtoolset), because the stability focus of CentOS makes it
unsuitable for HPC inasmuch as stable basically means "old and lacking good
support for newer hardware."
However that's not going to help your issue with scaling across nodes.
Given that you're not using PME, and assuming that your system is large
enough (e.g. at least a few tens of thousands of particles), then the most
likely issue is that the network latency is unsuitable. GROMACS can work
over gigabit ethernet.
Here is more of the log file:
>
>
> On 32 MPI ranks
>
>
> Computing: Num Num Call Wall time Giga-Cycles
>
> Ranks Threads Count (s) total sum %
>
>
> -----------------------------------------------------------------------------
>
> Domain decomp. 32 1 1666 18.920 1509.802 3.8
>
> DD comm. load 32 1 1666 0.017 1.394 0.0
>
> DD comm. bounds 32 1 1666 0.206 16.406 0.0
>
> Vsite constr. 32 1 50001 4.624 369.013 0.9
>
> Neighbor search 32 1 1667 19.646 1567.793 4.0
>
> Comm. coord. 32 1 48334 8.291 661.640 1.7
>
> Force 32 1 50001 339.477 27090.350 68.6
>
> Wait + Comm. F 32 1 50001 12.691 1012.783 2.6
>
> NB X/F buffer ops. 32 1 146669 13.563 1082.352 2.7
>
> Vsite spread 32 1 50001 8.716 695.518 1.8
>
> Write traj. 32 1 2 0.080 6.366 0.0
>
> Update 32 1 50001 37.268 2973.983 7.5
>
> Constraints 32 1 50001 25.674 2048.789 5.2
>
> Comm. energies 32 1 5001 0.965 77.013 0.2
>
> Rest 4.385 349.931 0.9
>
>
> -----------------------------------------------------------------------------
>
> Total 494.524 39463.132 100.0
>
>
> -----------------------------------------------------------------------------?
>
>
>
> If that's not helpful, I would need more specifics on what part of the log
> file would be.
That all looks very normal. To see where the bottleneck emerges requires
that one compare multiple log files, however. If the network is the problem
then most fields apart from Force will increase their % share of run time.
You could upload some log files to a file-sharing service and share links
if you want some feedback.
Failing that, if anyone could recommend some good documentation for
> optimizing performance I would greatly appreciate it, thank you!
>
http://manual.gromacs.org/documentation/2016.4/user-guide/mdrun-performance.html
but
many points won't apply because you're not using PME, and so you're in the
easy case. You should be able to scale to under 500 particles per core, but
the actual target varies heavily with hardware, the use of vsites, and the
network performance.
Mark
More information about the gromacs.org_gmx-users
mailing list