[gmx-users] No performance increase with single vs multiple nodes

Sun Oct 8 02:39:41 CEST 2017

I am running gromacs 2016.3 on CentOS 7.3 with the following command using a PBS scheduler:

#PBS -N TEST

#PBS -l nodes=1:ppn=32

export OMP_NUM_THREADS=1

mpirun -N 32  mdrun_mpi -deffnm TEST -dlb yes -pin on -nsteps 50000 -cpi TEST

However, I am seeing no performance increase when using more nodes:

On 32 MPI ranks
               Core t (s)   Wall t (s)        (%)
       Time:    28307.873      884.621     3200.0
                 (ns/day)    (hour/ns)
Performance:      195.340        0.123

On 64 MPI ranks
               Core t (s)   Wall t (s)        (%)
       Time:    25502.709      398.480     6400.0
                 (ns/day)    (hour/ns)
Performance:      216.828        0.111

On 96 MPI ranks
               Core t (s)   Wall t (s)        (%)
       Time:    51977.705      541.434     9600.0
                 (ns/day)    (hour/ns)
Performance:      159.579        0.150

On 128 MPI ranks
               Core t (s)   Wall t (s)        (%)
       Time:   111576.333      871.690    12800.0
                 (ns/day)    (hour/ns)
Performance:      198.238        0.121

?

Doing an strace of the mdrun process shows mostly this:

gettimeofday({1502811207, 567216}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=17, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}], 5, 0) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {988952, 423108300}) = 0

Process 4818 attached

% time     seconds  usecs/call     calls    errors syscall

------ ----------- ----------- --------- --------- ----------------

 50.62    0.032132           0   1243696           clock_gettime

 20.41    0.012954           0    399833           futex

 12.59    0.007991         363        22           close

  4.71    0.002989         498         6         2 fsync

  4.06    0.002579          18       141           write

  2.39    0.001515           0     46632           gettimeofday

  2.35    0.001494           0     23316           poll

  1.55    0.000981         981         1           rename

  1.16    0.000734         147         5         3 open

  0.15    0.000093           1        70           munmap

  0.03    0.000018           9         2         1 epoll_ctl

  0.00    0.000002           1         4           nanosleep

  0.00    0.000001           0         9           lseek

  0.00    0.000000           0         7           read

  0.00    0.000000           0         4         1 stat

  0.00    0.000000           0         5           fstat

  0.00    0.000000           0         2           mmap

  0.00    0.000000           0        10           mprotect

  0.00    0.000000           0         8           brk

  0.00    0.000000           0         2           shutdown

  0.00    0.000000           0         1           uname

  0.00    0.000000           0        10           getdents

  0.00    0.000000           0         1           rmdir

  0.00    0.000000           0        17        16 unlink

  0.00    0.000000           0         7         1 openat

------ ----------- ----------- --------- --------- ----------------

100.00    0.063483               1713811        24 total

And here is the compilation information:

GROMACS version:    2016.3
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        disabled
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.5
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.0
Tracing support:    disabled
Built on:           Fri Aug 11 16:23:00 EDT 2017
Built by:           citadmin at CRUSH-LCS-10-51-51-163 [CMAKE]
Build OS/arch:      Linux 3.10.0-514.21.2.el7.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz
Build CPU family:   6   Model: 63   Stepping: 4
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle lahf mmx msr pclmuldq popcnt pse rdrnd rtm sse2 sse3 sse4.1 sse4.2 ssse3
C compiler:         /usr/local/bin/mpicc GNU 6.2.1
C compiler flags:    -march=core-avx2     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler:       /usr/local/bin/mpicxx GNU 6.2.1
C++ compiler flags:  -march=core-avx2    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast

Any help would be appreciated, thank you!

-Matt

Matthew Hanley
IT Analyst
College of Engineering and Computer Science
Syracuse University
mwhanley at syr.edu