[gmx-users] CPU running doesn't match command line

Thu Aug 18 07:56:40 CEST 2016

anybody has more suggestions?

thx a lot

On 08/17/2016 09:07 AM, Albert wrote:
> Hello:
>
> Here is the information that you asked for.
>
>   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 
> -gpu_id 0 -pin on
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>
> GROMACS:      gmx mdrun, VERSION 5.1.3
> Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
> Data prefix:  /soft/gromacs/5.1.3_intel
> Command line:
>   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 
> -gpu_id 0 -pin on
>
> GROMACS version:    VERSION 5.1.3
> Precision:          single
> Memory model:       64 bit
> MPI library:        MPI
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:        enabled
> OpenCL support:     disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> SIMD instructions:  AVX_256
> FFT library:        fftw-3.3.4-sse2
> RDTSCP usage:       enabled
> C++11 compilation:  disabled
> TNG support:        enabled
> Tracing support:    disabled
> Built on:           Thu Aug 11 16:15:26 CEST 2016
> Built by:           albert at cudaB [CMAKE]
> Build OS/arch:      Linux 3.16.7-35-desktop x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
> Build CPU family:   6   Model: 62   Stepping: 4
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm 
> mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp 
> sse2 sse3 sse4.1
> sse4.2 ssse3 tdt x2apic
> C compiler:         /soft/intel/impi/5.1.3.223/bin64/mpicc GNU 4.8.3
> C compiler flags:    -mavx    -Wextra -Wno-missing-field-initializers 
> -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value 
> -Wunused-parameter  -
> O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
> C++ compiler:       /soft/intel/impi/5.1.3.223/bin64/mpicxx GNU 4.8.3
> C++ compiler flags:  -mavx    -Wextra -Wno-missing-field-initializers 
> -Wpointer-arith -Wall -Wno-unused-function  -O3 -DNDEBUG 
> -funroll-all-loops -fexcess-pre
> cision=fast  -Wno-array-bounds
> Boost version:      1.54.0 (external)
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda 
> compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on 
> Wed_May__4_21:01:56_CDT
> _2016;Cuda compilation tools, release 8.0, V8.0.26
> CUDA compiler 
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=
> sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode 
>
> ;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;; 
> ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wal
> l;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds; 
>
> CUDA driver:        8.0
> CUDA runtime:       8.0
>
> Running on 1 node with total 10 cores, 20 logical cores, 2 compatible 
> GPUs
> Hardware detected on host cudaB (the node of MPI rank 0):
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
>     Family:  6  model: 62  stepping:  4
>     CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm 
> mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp 
> sse2 sse3 sse4.1 ss
> e4.2 ssse3 tdt x2apic
>     SIMD instructions most likely to fit this hardware: AVX_256
>     SIMD instructions selected at GROMACS compile time: AVX_256
>   GPU info:
>     Number of GPUs detected: 2
>     #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat: 
> compatible
>     #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat: 
> compatible
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
>
>
>
>
>
>
>
>
>   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 
> -gpu_id 1 -pin on -cpi -append -pinoffset 8
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
>
> GROMACS:      gmx mdrun, VERSION 5.1.3
> Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
> Data prefix:  /soft/gromacs/5.1.3_intel
> Command line:
>   gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 
> -gpu_id 1 -pin on -cpi -append -pinoffset 8
>
>
> Running on 1 node with total 10 cores, 20 logical cores, 2 compatible 
> GPUs
> Hardware detected on host cudaB (the node of MPI rank 0):
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
>     SIMD instructions most likely to fit this hardware: AVX_256
>     SIMD instructions selected at GROMACS compile time: AVX_256
>   GPU info:
>     Number of GPUs detected: 2
>     #0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat: 
> compatible
>     #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC:  no, stat: 
> compatible
>
> Reading file 7.tpr, VERSION 5.1.3 (single precision)
>
> Reading checkpoint file state.cpt generated: Wed Aug 17 09:01:46 2016
>
>
> Using 1 MPI process
> Using 8 OpenMP threads
>
> 1 GPU user-selected for this run.
> Mapping of GPU ID to the 1 PP rank in this node: 1
>
> Applying core pinning offset 8
> starting mdrun 'Title'
> 50000000 steps, 100000.0 ps (continuing from step 5746000, 11492.0 ps).
> step 5746080: timed with pme grid 60 60 84, coulomb cutoff 1.000: 
> 2451.9 M-cycles
>
>
>
>
>
>
>
> On 08/16/2016 05:27 PM, Szilárd Páll wrote:
>> Most of that copy-pasted info is not what I asked for and overall not
>> very useful. You have still not shown any log files (or details on the
>> hardware). Share the *relevant* stuff, please!
>> -- 
>> Szilárd
>>
>>
>> On Tue, Aug 16, 2016 at 5:07 PM, Albert <mailmd2011 at gmail.com> wrote:
>>> Hello:
>>>
>>> Here is my MDP file:
>>>
>>> define                  = -DREST_ON -DSTEP6_4
>>> integrator              = md
>>> dt                      = 0.002
>>> nsteps                  = 1000000
>>> nstlog                  = 1000
>>> nstxout                 = 0
>>> nstvout                 = 0
>>> nstfout                 = 0
>>> nstcalcenergy           = 100
>>> nstenergy               = 1000
>>> nstxout-compressed      = 10000
>>> ;
>>> cutoff-scheme           = Verlet
>>> nstlist                 = 20
>>> rlist                   = 1.0
>>> coulombtype             = pme
>>> rcoulomb                = 1.0
>>> vdwtype                 = Cut-off
>>> vdw-modifier            = Force-switch
>>> rvdw_switch             = 0.9
>>> rvdw                    = 1.0
>>> ;
>>> tcoupl                  = berendsen
>>> tc_grps                 = PROT   MEMB   SOL_ION
>>> tau_t                   = 1.0    1.0    1.0
>>> ref_t                   = 310   310   310
>>> ;
>>> pcoupl                  = berendsen
>>> pcoupltype              = semiisotropic
>>> tau_p                   = 5.0
>>> compressibility         = 4.5e-5  4.5e-5
>>> ref_p                   = 1.0     1.0
>>> ;
>>> constraints             = h-bonds
>>> constraint_algorithm    = LINCS
>>> continuation            = yes
>>> ;
>>> nstcomm                 = 100
>>> comm_mode               = linear
>>> comm_grps               = PROT   MEMB   SOL_ION
>>> ;
>>> refcoord_scaling        = com
>>>
>>>
>>> I compiled Gromacs with the following settings, using Intel MPI:
>>>
>>> env CC=mpicc CXX=mpicxx F77=mpif90 FC=mpif90 LDF90=mpif90
>>> CMAKE_PREFIX_PATH=/soft/gromacs/fftw-3.3.4:/soft/intel/impi/5.1.3.223 cmake 
>>>
>>> .. -DBUILD_SHARED_LIB=OFF -DBUILD_TESTING=OFF
>>> -DCMAKE_INSTALL_PREFIX=/soft/gromacs/5.1.3_intel -DGMX_MPI=ON 
>>> -DGMX_GPU=ON
>>> -DGMX_PREFER_STATIC_LIBS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
>>>
>>>
>>> I tried it again with one of the job with options:
>>>
>>> -ntomp 8 -pin on -pinoffset 8
>>>
>>>
>>> The two submitted jobs can still only use 8 CPU and the speed is 
>>> extremely
>>> slow (10ns/day)....when I remove option "-pin on" from one of the 
>>> job, it
>>> fasten a lot (32ns/day) and 16 CPU were used..... If I only submit 
>>> one job
>>> with option "-pin on", I can obtain 52ns/day..........
>>>
>>>
>>> thx a lot
>>>
>>>
>>> On 08/16/2016 04:59 PM, Szilárd Páll wrote:
>>>> Hi,
>>>>
>>>> Without log and hw configs, I it's hard to tell what's happening.
>>>>
>>>> By turning off pinning the OS is free to move threads around and it
>>>> will try to ensure cores are utilized. However, by leaving threads
>>>> up-pinned you risk taking a significant performance hit. So I'd
>>>> recommend that you run with correct settings.
>>>>
>>>> If you start with "-ntomp 8 -pin on -pioffset 8" (and you indeed have
>>>> 16 cores no HT), you should be able to see in htop the first eight
>>>> cores empty while the last eight occupied.
>>>>
>>>> Cheers,
>>>> -- 
>>>> Szilárd
>>>
>>> -- 
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before 
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>> or send a
>>> mail to gmx-users-request at gromacs.org.
>