[gmx-users] No performance increase with single vs multiple nodes

Wed Oct 25 10:21:47 CEST 2017

Hi,

On Wed, Oct 25, 2017 at 4:24 AM Matthew W Hanley <mwhanley at syr.edu> wrote:

> > There's several dozen lines of performance analysis at the end of the log
>
> > file, which you need to inspect and compare if you want to start to
> > understand what is going on :-)
>
> Thank you for the feedback.  Fair warning, I'm more of a system
> administrator than a regular gromacs user.  What is it that I should be
> focused on, and more importantly how do I find the bottleneck?  Gromacs
> does recommend using AVX2_256, but I was unable to get Gromacs to build
> using that.

That's your first thing to do, then :-) Presumably you need an updated
toolchain (e.g. devtoolset), because the stability focus of CentOS makes it
unsuitable for HPC inasmuch as stable basically means "old and lacking good
support for newer hardware."

However that's not going to help your issue with scaling across nodes.
Given that you're not using PME, and assuming that your system is large
enough (e.g. at least a few tens of thousands of particles), then the most
likely issue is that the network latency is unsuitable. GROMACS can work
over gigabit ethernet.

Here is more of the log file:
>
>
> On 32 MPI ranks
>
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>
>                      Ranks Threads  Count      (s)         total sum    %
>
>
> -----------------------------------------------------------------------------
>
>  Domain decomp.        32    1       1666      18.920       1509.802   3.8
>
>  DD comm. load         32    1       1666       0.017          1.394   0.0
>
>  DD comm. bounds       32    1       1666       0.206         16.406   0.0
>
>  Vsite constr.         32    1      50001       4.624        369.013   0.9
>
>  Neighbor search       32    1       1667      19.646       1567.793   4.0
>
>  Comm. coord.          32    1      48334       8.291        661.640   1.7
>
>  Force                 32    1      50001     339.477      27090.350  68.6
>
>  Wait + Comm. F        32    1      50001      12.691       1012.783   2.6
>
>  NB X/F buffer ops.    32    1     146669      13.563       1082.352   2.7
>
>  Vsite spread          32    1      50001       8.716        695.518   1.8
>
>  Write traj.           32    1          2       0.080          6.366   0.0
>
>  Update                32    1      50001      37.268       2973.983   7.5
>
>  Constraints           32    1      50001      25.674       2048.789   5.2
>
>  Comm. energies        32    1       5001       0.965         77.013   0.2
>
>  Rest                                           4.385        349.931   0.9
>
>
> -----------------------------------------------------------------------------
>
>  Total                                        494.524      39463.132 100.0
>
>
> -----------------------------------------------------------------------------?
>
>
>
> If that's not helpful, I would need more specifics on what part of the log
> file would be.

That all looks very normal. To see where the bottleneck emerges requires
that one compare multiple log files, however. If the network is the problem
then most fields apart from Force will increase their % share of run time.
You could upload some log files to a file-sharing service and share links
if you want some feedback.

  Failing that, if anyone could recommend some good documentation for
> optimizing performance I would greatly appreciate it, thank you!
>

http://manual.gromacs.org/documentation/2016.4/user-guide/mdrun-performance.html
but
many points won't apply because you're not using PME, and so you're in the
easy case. You should be able to scale to under 500 particles per core, but
the actual target varies heavily with hardware, the use of vsites, and the
network performance.

Mark