[gmx-users] CPU running doesn't match command line
Szilárd Páll
pall.szilard at gmail.com
Wed Aug 24 01:04:54 CEST 2016
On Wed, Aug 24, 2016 at 1:03 AM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> On Mon, Aug 22, 2016 at 5:36 PM, Albert <mailmd2011 at gmail.com> wrote:
>> Hello Mark:
>>
>> I've recompiled Gromacs without MPI. I run submit the job with the command
>> line you suggested.
>>
>> gmx mdrun -ntomp 10 -v -g test.log -pin on -pinoffset 0 -gpu_id 0 -s
>> test.tpr >& test.info
>> gmx mdrun -ntomp 10 -v -g test.log -pin on -pinoffset 10 -gpu_id 1 -s
>> test.tpr >& test.info
>
> You need to pass a "-pinstride 1" there -- at least to the first
> launch, but best if you do it for both (full explanation below).
>
> You may want to consider using mdrun -multi which will automatically
> set up the correct pinning for all its sub-simulations (it can even be
> used across multiple nodes with any network to run a set of
> simulations); note that with 5.x I recohencemmend it only if the two
> runs are expected to run with the same speed (as they are sync-ed up
> even if they don't communicate), but this should be greatly improved
> in the '16 release.
>
>
> Because the first mdrun will assume that it will run alone on the node
> and it will not make use of Hyperthreading but rather use all physical
> cores first. Hence, when started with only 10 threads on a 10 core /
> 20 threads CPU, it will bind its threads to the 10 different cores
> rather 10 threads of the first 5 cores. The second run will not make
> this assumption anymore (due to non-zero offsetting correctly
> interpreted as a possible hint that multiple jobs are run
> side-by-side); hence, this run use the last 5 cores with 2 threads
> each. Consequently, in total you'll have 20 threads running, 5 of them
> pinned to the same 5 hardware threads => 15 hardware threads used =>
> top shows 7.5 "CPUs" used.
>
> Cheers
> --
> Szilárd
>
> PS: In this discussion thread alone I've asked for *complete* log
> files twice. I'd really appreciate if
...such request were taken seriously.
PPS:
To understand the stride issue and thread layout which admittedly can
be confusing, you can inspect the lstopo output (assuming you have it
installed) or run the latest release which, if compiled with hwloc
support, will print something like this (for a 6-core processor with
HT):
Sockets, cores, and logical processors:
Socket 0: [ 0 6] [ 1 7] [ 2 8] [ 3 9] [ 4
10] [ 5 11]
Consequently, a stride 1 will put threads on hardware threads
("logical processors") 0,6,1,7,..., while stride 2 will not use the
second hw thread, hence will use start pinning with 0,1,2,3,...
--
Szilárd
>>
>> I specified 20 cores CPU in all, but I noticed that only 15 cores were
>> actually being used. I am pretty confused for that.
>>
>>
>> Here is my log file:
>>
>>
>> GROMACS: gmx mdrun, VERSION 5.1.3
>> Executable: /soft/gromacs/5.1.3_intel-thread/bin/gmx
>> Data prefix: /soft/gromacs/5.1.3_intel-threadgmx mdrun -quiet -v -ntomp 6 -gpu_id 0 -nsteps -1 -pin on
>> Command line:
>> gmx mdrun -ntomp 10 -v -g test.log -pin on -pinoffset 0 -gpu_id 0 -s
>> test.tpr
>> Hardware detected:
>> CPU info:
>> Vendor: GenuineIntel
>> Brand: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
>> SIMD instructions most likely to fit this hardware: AVX_256
>> SIMD instructions selected at GROMACS compile time: AVX_256
>> GPU info:
>> Number of GPUs detected: 2
>> #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat:
>> compatible
>> #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat:
>> compatible
>>
>> Reading file test.tpr, VERSION 5.1.3 (single precision)
>> Using 1 MPI thread
>> Using 10 OpenMP threads
>>
>> 1 GPU user-selected for this run.
>> Mapping of GPU ID to the 1 PP rank in this node: 0
>>
>> starting mdrun 'Title'
>> 5000000 steps, 10000.0 ps.
>> step 80: timed with pme grid 60 60 96, coulomb cutoff 1.000: 1634.1
>> M-cycles
>> step 160: timed with pme grid 56 56 84, coulomb cutoff 1.047: 1175.4
>> M-cycles
>>
>>
>>
>>
>> GROMACS: gmx mdrun, VERSION 5.1.3
>> Executable: /soft/gromacs/5.1.3_intel-thread/bin/gmx
>> Data prefix: /soft/gromacs/5.1.3_intel-thread
>> Command line:
>> gmx mdrun -ntomp 10 -v -g test.log -pin on -pinoffset 10 -gpu_id 1 -s
>> test.tpr
>>
>> Running on 1 node with total 10 cores, 20 logical cores, 2 compatible GPUs
>> Hardware detected:
>> CPU info:
>> Vendor: GenuineIntel
>> Brand: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
>> SIMD instructions most likely to fit this hardware: AVX_256
>> SIMD instructions selected at GROMACS compile time: AVX_256
>> GPU info:
>> Number of GPUs detected: 2
>> #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat:
>> compatible
>> #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat:
>> compatible
>>
>> Reading file test.tpr, VERSION 5.1.3 (single precision)
>> Using 1 MPI thread
>> Using 10 OpenMP threads
>>
>> 1 GPU user-selected for this run.
>> Mapping of GPU ID to the 1 PP rank in this node: 1
>>
>> Applying core pinning offset 10
>> starting mdrun 'Title'
>> 5000000 steps, 10000.0 ps.
>> step 80: timed with pme grid 60 60 84, coulomb cutoff 1.000: 657.2
>> M-cycles
>> step 160: timed with pme grid 52 52 80, coulomb cutoff 1.096: 622.8
>> M-cycles
>> step 240: timed with pme grid 48 48 72, coulomb cutoff 1.187: 593.9
>> M-cycles
>>
>>
>>
>>
>>
>> On 08/18/2016 02:13 PM, Mark Abraham wrote:
>>>
>>> Hi,
>>>
>>> It's a bit curious to want to run two 8-thread jobs on a machine with 10
>>> physical cores because you'll get lots of performance imbalance because
>>> some threads must share the same physical core, but I guess it's a free
>>> world. As I suggested the other day,
>>>
>>> http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html#examples-for-mdrun-on-one-node
>>> has
>>> some examples. The fact you've compiled and linked with an MPI library
>>> means it may be involving itself in the thread-affinity management, but
>>> whether it is doing that is something between you, it, the docs and the
>>> cluster admins. If you're just wanting to run on a single node, do
>>> yourself
>>> a favour and build the thread-MPI flavour.
>>>
>>> If so, you probably want more like
>>> gmx mdrun -ntomp 10 -pin on -pinoffset 0 -gpu_id 0 -s run1
>>> gmx mdrun -ntomp 10 -pin on -pinoffset 10 -gpu_id 1 -s run2
>>>
>>> If you want to use the MPI build, then I suggest you read up on how its
>>> mpirun will let you manage keeping the threads of processes where you want
>>> them (ie apart).
>>>
>>> Mark
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>> mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list