[gmx-developers] Trying to understand performance issue - due to all threads pinning to first core?

Wed Dec 19 19:51:05 CET 2012

Hi,

you can use -nopin but you will get slightly lower performance. You can
also use pinoffset and number the different gromacs instances you are
running any way you wish. Most programs don't pin so it is unlikely that it
going to conflict with other programs if you are running both Gromacs and
other programs.

Roland

On Wed, Dec 19, 2012 at 1:23 PM, Shirts, Michael (mrs5pt) <
mrs5pt at eservices.virginia.edu> wrote:

> Hi, all-
>
> I'm trying to figure out the reason for a performance hit when running on a
> single core with the new code, which is specifically reflected in a core
> time that is significantly less than the wall time (about 1/3).  Apologies
> if this has already been discussed and I missed it!  I have some theories,
> but need some more help figuring this out.
>
> Executive summary -- is there something which causes all jobs sent to the
> same node be pinned to the first core, so that if there are 8 jobs
> requesting 1 thread each on an 8 node CPU, they will just steal from each
> other on the first core rather than operating on different cores?  If so,
> how can this be avoided?  Looking at online docs, it seems that -pinoffset
> options might help, but there is no way to tell beforehand where the jobs
> will be sent, or what other users will be doing with THEIR programs.  Is
> there a way to make this simpler?  To 'just work' and use the available
> cores like it did in 4.5.5?
>
> Details:
>
> In everything, I am using only group cutoffs as well as only thread_mpi
> (though only one thread, so thread_mpi shouldn't matter).
>
> When I run with a single core using a PBS script (including only the cpu
> selection line in the PBS script), for example:
>
> #PBS -l select=1:mpiprocs=1:ncpus=1
>
> Running command:
>
> mdrun_d -ntomp 1 -ntmpi 1
>
> I find that with 4.6 beta I got.
>
> >                Core t (s)   Wall t (s)        (%)
> >        Time:      526.200     1586.919       33.2
> >                  (ns/day)    (hour/ns)
> > Performance:        1.089       22.038
>
> Note that the core time is only about 1/3 of the wall time.
>
> This also occurs when running with simply:
>
> #PBS -l nodes=1:ppn=1
> mdrun_d -nt 1
>
> However, other runs with identical call parameters got up to 96%
> utilization.  Logging directly onto the compute notes and running 'top', I
> found that the CPU use percent was somewhere between 10 and 40% for the 8
> jobs running (all of which used 1 thread). It should have been 100% for
> each, as far as I can tell.   When I was able to isolate a run that was
> going faster, I logged into it's compute node and found that it was indeed
> running alone, with a CPU utilization determined by 'top' of near 100%.
>
> So, is there something pinning 1 core jobs to the first thread?
>
> When running a (different chemical system which is inherently faster, same
> 4.6 code) with all 8 processors:
>
> #PBS -l select=1:mpiprocs=8:ncpus=8
> mdrun_d -ntmpi 8
>
> >               Core t (s)   Wall t (s)        (%)
> >       Time:   324684.500    40826.497      795.3
> >                 (ns/day)    (hour/ns)
> >Performance:      321.181        0.075
>
> Here, we get near full resources: utilization is 795.3/8 = 99.4%
>
> With older code (modifications of 4.5.5), and the same system as the first
> example, running:
>
> #PBS -l nodes=1:ppn=1
> Mdrun_d -nt 1
>
> then even though the core/note time drops by 10-15% (yay speed increases in
> 4.6!) the wall time is much closer to 100%, so the throughput is much
> better
> than old one process timing.  These results are very consistent.  They
> don't
> depend on what else is being run on the node.
>
> Old code:
>
> >                NODE (s)   Real (s)      (%)
> >        Time:    611.460    623.325     98.1
> >                        10:11
> >                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> > Performance:     36.451      1.823      2.826      8.492
>
>
> Some information from the new code setup (from the log)
>
> Host: lc5-compute-1-2.local  pid: 32227  nodeid: 0  nnodes:  1
> Gromacs version:    VERSION 4.6-beta2-dev-20121217-e233b32
> GIT SHA1 hash:      e233b3231ae94805ae489840133ffcc225263d3a
> Branched from:      c5706f32cc2363c50b61ec0a207bf93dc20220a1 (4 newer local
> commits)
> Precision:          double
> MPI library:        thread_mpi
> OpenMP support:     enabled
> GPU support:        disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> CPU acceleration:   SSE2
> FFT library:        fftw-3.2.2
> Large file support: enabled
> RDTSCP usage:       disabled
> Built on:           Mon Dec  3 10:14:02 EST 2012
> Built by:           mrs5pt at fir-s.itc.virginia.edu [CMAKE]
> Build OS/arch:      Linux 2.6.18-308.11.1.el5 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
> Build CPU family:   6   Model: 26   Stepping: 5
> Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr
> nonstop_tsc
> pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> C compiler:         /usr/bin/gcc GNU gcc (GCC) 4.1.2 20080704 (Red Hat
> 4.1.2-50)
> C compiler flags:   -msse2  -Wextra -Wno-missing-field-initializers
> -Wno-sign-compare -Wall -Wno-unused -Wunused-value   -fomit-frame-pointer
> -funroll-all-loops  -O3 -DNDEBUG
>
> . . .
>
> Using 1 MPI thread
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
> Family:  6  Model: 23  Stepping: 10
> Features: apic clfsh cmov cx8 cx16 lahf_lm mmx msr pdcm pse sse2 sse3
> sse4.1
> ssse3
> Acceleration most likely to fit this hardware: SSE4.1
> Acceleration selected at GROMACS compile time: SSE2
>
>
> Binary not matching hardware - you might be losing performance.
> Acceleration most likely to fit this hardware: SSE4.1
> Acceleration selected at GROMACS compile time: SSE2
>
>
> Best,
> ~~~~~~~~~~~~
> Michael Shirts
> Assistant Professor
> Department of Chemical Engineering
> University of Virginia
> michael.shirts at virginia.edu
> (434)-243-1821
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
>
>
>
>

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20121219/6e11094d/attachment.html>