[gmx-users] CPU running doesn't match command line

Thu Aug 18 14:13:54 CEST 2016

Hi,

It's a bit curious to want to run two 8-thread jobs on a machine with 10
physical cores because you'll get lots of performance imbalance because
some threads must share the same physical core, but I guess it's a free
world. As I suggested the other day,
http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html#examples-for-mdrun-on-one-node
has
some examples. The fact you've compiled and linked with an MPI library
means it may be involving itself in the thread-affinity management, but
whether it is doing that is something between you, it, the docs and the
cluster admins. If you're just wanting to run on a single node, do yourself
a favour and build the thread-MPI flavour.

If so, you probably want more like
gmx mdrun -ntomp 10 -pin on -pinoffset 0 -gpu_id 0 -s run1
gmx mdrun -ntomp 10 -pin on -pinoffset 10 -gpu_id 1 -s run2

If you want to use the MPI build, then I suggest you read up on how its
mpirun will let you manage keeping the threads of processes where you want
them (ie apart).

Mark

On Thu, Aug 18, 2016 at 7:57 AM Albert <mailmd2011 at gmail.com> wrote:

> anybody has more suggestions?
>
> thx a lot
>
>
> On 08/17/2016 09:07 AM, Albert wrote:
> > Hello:
> >
> > Here is the information that you asked for.
> >
> >   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8
> > -gpu_id 0 -pin on
> >
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > GROMACS:      gmx mdrun, VERSION 5.1.3
> > Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
> > Data prefix:  /soft/gromacs/5.1.3_intel
> > Command line:
> >   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8
> > -gpu_id 0 -pin on
> >
> > GROMACS version:    VERSION 5.1.3
> > Precision:          single
> > Memory model:       64 bit
> > MPI library:        MPI
> > OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support:        enabled
> > OpenCL support:     disabled
> > invsqrt routine:    gmx_software_invsqrt(x)
> > SIMD instructions:  AVX_256
> > FFT library:        fftw-3.3.4-sse2
> > RDTSCP usage:       enabled
> > C++11 compilation:  disabled
> > TNG support:        enabled
> > Tracing support:    disabled
> > Built on:           Thu Aug 11 16:15:26 CEST 2016
> > Built by:           albert at cudaB [CMAKE]
> > Build OS/arch:      Linux 3.16.7-35-desktop x86_64
> > Build CPU vendor:   GenuineIntel
> > Build CPU brand:    Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
> > Build CPU family:   6   Model: 62   Stepping: 4
> > Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
> > mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> > sse2 sse3 sse4.1
> > sse4.2 ssse3 tdt x2apic
> > C compiler:         /soft/intel/impi/5.1.3.223/bin64/mpicc GNU 4.8.3
> > C compiler flags:    -mavx    -Wextra -Wno-missing-field-initializers
> > -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value
> > -Wunused-parameter  -
> > O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> > C++ compiler:       /soft/intel/impi/5.1.3.223/bin64/mpicxx GNU 4.8.3
> > C++ compiler flags:  -mavx    -Wextra -Wno-missing-field-initializers
> > -Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG
> > -funroll-all-loops -fexcess-pre
> > cision=fast  -Wno-array-bounds
> > Boost version:      1.54.0 (external)
> > CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
> > compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> > Wed_May__4_21:01:56_CDT
> > _2016;Cuda compilation tools, release 8.0, V8.0.26
> > CUDA compiler
> >
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=
> >
> sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode
> >
> >
> ;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> > ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wal
> >
> l;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> >
> > CUDA driver:        8.0
> > CUDA runtime:       8.0
> >
> > Running on 1 node with total 10 cores, 20 logical cores, 2 compatible
> > GPUs
> > Hardware detected on host cudaB (the node of MPI rank 0):
> >   CPU info:
> >     Vendor: GenuineIntel
> >     Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
> >     Family:  6  model: 62  stepping:  4
> >     CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
> > mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp
> > sse2 sse3 sse4.1 ss
> > e4.2 ssse3 tdt x2apic
> >     SIMD instructions most likely to fit this hardware: AVX_256
> >     SIMD instructions selected at GROMACS compile time: AVX_256
> >   GPU info:
> >     Number of GPUs detected: 2
> >     #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat:
> > compatible
> >     #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat:
> > compatible
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> >
> >
> >
> >
> >
> >
> >
> >   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8
> > -gpu_id 1 -pin on -cpi -append -pinoffset 8
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > GROMACS:      gmx mdrun, VERSION 5.1.3
> > Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
> > Data prefix:  /soft/gromacs/5.1.3_intel
> > Command line:
> >   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8
> > -gpu_id 1 -pin on -cpi -append -pinoffset 8
> >
> >
> > Running on 1 node with total 10 cores, 20 logical cores, 2 compatible
> > GPUs
> > Hardware detected on host cudaB (the node of MPI rank 0):
> >   CPU info:
> >     Vendor: GenuineIntel
> >     Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
> >     SIMD instructions most likely to fit this hardware: AVX_256
> >     SIMD instructions selected at GROMACS compile time: AVX_256
> >   GPU info:
> >     Number of GPUs detected: 2
> >     #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat:
> > compatible
> >     #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat:
> > compatible
> >
> > Reading file 7.tpr, VERSION 5.1.3 (single precision)
> >
> > Reading checkpoint file state.cpt generated: Wed Aug 17 09:01:46 2016
> >
> >
> > Using 1 MPI process
> > Using 8 OpenMP threads
> >
> > 1 GPU user-selected for this run.
> > Mapping of GPU ID to the 1 PP rank in this node: 1
> >
> > Applying core pinning offset 8
> > starting mdrun 'Title'
> > 50000000 steps, 100000.0 ps (continuing from step 5746000, 11492.0 ps).
> > step 5746080: timed with pme grid 60 60 84, coulomb cutoff 1.000:
> > 2451.9 M-cycles
> >
> >
> >
> >
> >
> >
> >
> > On 08/16/2016 05:27 PM, Szilárd Páll wrote:
> >> Most of that copy-pasted info is not what I asked for and overall not
> >> very useful. You have still not shown any log files (or details on the
> >> hardware). Share the *relevant* stuff, please!
> >> --
> >> Szilárd
> >>
> >>
> >> On Tue, Aug 16, 2016 at 5:07 PM, Albert <mailmd2011 at gmail.com> wrote:
> >>> Hello:
> >>>
> >>> Here is my MDP file:
> >>>
> >>> define                  = -DREST_ON -DSTEP6_4
> >>> integrator              = md
> >>> dt                      = 0.002
> >>> nsteps                  = 1000000
> >>> nstlog                  = 1000
> >>> nstxout                 = 0
> >>> nstvout                 = 0
> >>> nstfout                 = 0
> >>> nstcalcenergy           = 100
> >>> nstenergy               = 1000
> >>> nstxout-compressed      = 10000
> >>> ;
> >>> cutoff-scheme           = Verlet
> >>> nstlist                 = 20
> >>> rlist                   = 1.0
> >>> coulombtype             = pme
> >>> rcoulomb                = 1.0
> >>> vdwtype                 = Cut-off
> >>> vdw-modifier            = Force-switch
> >>> rvdw_switch             = 0.9
> >>> rvdw                    = 1.0
> >>> ;
> >>> tcoupl                  = berendsen
> >>> tc_grps                 = PROT   MEMB   SOL_ION
> >>> tau_t                   = 1.0    1.0    1.0
> >>> ref_t                   = 310   310   310
> >>> ;
> >>> pcoupl                  = berendsen
> >>> pcoupltype              = semiisotropic
> >>> tau_p                   = 5.0
> >>> compressibility         = 4.5e-5  4.5e-5
> >>> ref_p                   = 1.0     1.0
> >>> ;
> >>> constraints             = h-bonds
> >>> constraint_algorithm    = LINCS
> >>> continuation            = yes
> >>> ;
> >>> nstcomm                 = 100
> >>> comm_mode               = linear
> >>> comm_grps               = PROT   MEMB   SOL_ION
> >>> ;
> >>> refcoord_scaling        = com
> >>>
> >>>
> >>> I compiled Gromacs with the following settings, using Intel MPI:
> >>>
> >>> env CC=mpicc CXX=mpicxx F77=mpif90 FC=mpif90 LDF90=mpif90
> >>> CMAKE_PREFIX_PATH=/soft/gromacs/fftw-3.3.4:/soft/intel/impi/5.1.3.223
> cmake
> >>>
> >>> .. -DBUILD_SHARED_LIB=OFF -DBUILD_TESTING=OFF
> >>> -DCMAKE_INSTALL_PREFIX=/soft/gromacs/5.1.3_intel -DGMX_MPI=ON
> >>> -DGMX_GPU=ON
> >>> -DGMX_PREFER_STATIC_LIBS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
> >>>
> >>>
> >>> I tried it again with one of the job with options:
> >>>
> >>> -ntomp 8 -pin on -pinoffset 8
> >>>
> >>>
> >>> The two submitted jobs can still only use 8 CPU and the speed is
> >>> extremely
> >>> slow (10ns/day)....when I remove option "-pin on" from one of the
> >>> job, it
> >>> fasten a lot (32ns/day) and 16 CPU were used..... If I only submit
> >>> one job
> >>> with option "-pin on", I can obtain 52ns/day..........
> >>>
> >>>
> >>> thx a lot
> >>>
> >>>
> >>> On 08/16/2016 04:59 PM, Szilárd Páll wrote:
> >>>> Hi,
> >>>>
> >>>> Without log and hw configs, I it's hard to tell what's happening.
> >>>>
> >>>> By turning off pinning the OS is free to move threads around and it
> >>>> will try to ensure cores are utilized. However, by leaving threads
> >>>> up-pinned you risk taking a significant performance hit. So I'd
> >>>> recommend that you run with correct settings.
> >>>>
> >>>> If you start with "-ntomp 8 -pin on -pioffset 8" (and you indeed have
> >>>> 16 cores no HT), you should be able to see in htop the first eight
> >>>> cores empty while the last eight occupied.
> >>>>
> >>>> Cheers,
> >>>> --
> >>>> Szilárd
> >>>
> >>> --
> >>> Gromacs Users mailing list
> >>>
> >>> * Please search the archive at
> >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >>> posting!
> >>>
> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>
> >>> * For (un)subscribe requests visit
> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>> or send a
> >>> mail to gmx-users-request at gromacs.org.
> >
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.