[gmx-developers] Trying to understand performance issue - due to all threads pinning to first core?

Wed Dec 19 19:23:20 CET 2012

Hi, all-

I'm trying to figure out the reason for a performance hit when running on a
single core with the new code, which is specifically reflected in a core
time that is significantly less than the wall time (about 1/3).  Apologies
if this has already been discussed and I missed it!  I have some theories,
but need some more help figuring this out.

Executive summary -- is there something which causes all jobs sent to the
same node be pinned to the first core, so that if there are 8 jobs
requesting 1 thread each on an 8 node CPU, they will just steal from each
other on the first core rather than operating on different cores?  If so,
how can this be avoided?  Looking at online docs, it seems that -pinoffset
options might help, but there is no way to tell beforehand where the jobs
will be sent, or what other users will be doing with THEIR programs.  Is
there a way to make this simpler?  To 'just work' and use the available
cores like it did in 4.5.5?

Details:

In everything, I am using only group cutoffs as well as only thread_mpi
(though only one thread, so thread_mpi shouldn't matter).

When I run with a single core using a PBS script (including only the cpu
selection line in the PBS script), for example:

#PBS -l select=1:mpiprocs=1:ncpus=1

Running command:

mdrun_d -ntomp 1 -ntmpi 1

I find that with 4.6 beta I got.

>                Core t (s)   Wall t (s)        (%)
>        Time:      526.200     1586.919       33.2
>                  (ns/day)    (hour/ns)
> Performance:        1.089       22.038

Note that the core time is only about 1/3 of the wall time.

This also occurs when running with simply:

#PBS -l nodes=1:ppn=1
mdrun_d -nt 1

However, other runs with identical call parameters got up to 96%
utilization.  Logging directly onto the compute notes and running 'top', I
found that the CPU use percent was somewhere between 10 and 40% for the 8
jobs running (all of which used 1 thread). It should have been 100% for
each, as far as I can tell.   When I was able to isolate a run that was
going faster, I logged into it's compute node and found that it was indeed
running alone, with a CPU utilization determined by 'top' of near 100%.

So, is there something pinning 1 core jobs to the first thread?

When running a (different chemical system which is inherently faster, same
4.6 code) with all 8 processors:

#PBS -l select=1:mpiprocs=8:ncpus=8
mdrun_d -ntmpi 8

>               Core t (s)   Wall t (s)        (%)
>       Time:   324684.500    40826.497      795.3
>                 (ns/day)    (hour/ns)
>Performance:      321.181        0.075

Here, we get near full resources: utilization is 795.3/8 = 99.4%

With older code (modifications of 4.5.5), and the same system as the first
example, running: 

#PBS -l nodes=1:ppn=1
Mdrun_d -nt 1

then even though the core/note time drops by 10-15% (yay speed increases in
4.6!) the wall time is much closer to 100%, so the throughput is much better
than old one process timing.  These results are very consistent.  They don't
depend on what else is being run on the node.

Old code:

>                NODE (s)   Real (s)      (%)
>        Time:    611.460    623.325     98.1
>                        10:11
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:     36.451      1.823      2.826      8.492

Some information from the new code setup (from the log)

Host: lc5-compute-1-2.local  pid: 32227  nodeid: 0  nnodes:  1
Gromacs version:    VERSION 4.6-beta2-dev-20121217-e233b32
GIT SHA1 hash:      e233b3231ae94805ae489840133ffcc225263d3a
Branched from:      c5706f32cc2363c50b61ec0a207bf93dc20220a1 (4 newer local
commits)
Precision:          double
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        disabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   SSE2
FFT library:        fftw-3.2.2
Large file support: enabled
RDTSCP usage:       disabled
Built on:           Mon Dec  3 10:14:02 EST 2012
Built by:           mrs5pt at fir-s.itc.virginia.edu [CMAKE]
Build OS/arch:      Linux 2.6.18-308.11.1.el5 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
Build CPU family:   6   Model: 26   Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler:         /usr/bin/gcc GNU gcc (GCC) 4.1.2 20080704 (Red Hat
4.1.2-50)
C compiler flags:   -msse2  -Wextra -Wno-missing-field-initializers
-Wno-sign-compare -Wall -Wno-unused -Wunused-value   -fomit-frame-pointer
-funroll-all-loops  -O3 -DNDEBUG

. . . 

Using 1 MPI thread

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
Family:  6  Model: 23  Stepping: 10
Features: apic clfsh cmov cx8 cx16 lahf_lm mmx msr pdcm pse sse2 sse3 sse4.1
ssse3
Acceleration most likely to fit this hardware: SSE4.1
Acceleration selected at GROMACS compile time: SSE2

Binary not matching hardware - you might be losing performance.
Acceleration most likely to fit this hardware: SSE4.1
Acceleration selected at GROMACS compile time: SSE2

Best,
~~~~~~~~~~~~
Michael Shirts
Assistant Professor
Department of Chemical Engineering
University of Virginia
michael.shirts at virginia.edu
(434)-243-1821