[gmx-users] how to increase GMX_OPENMP_MAX_THREADS

Wed Feb 27 14:32:14 CET 2019

The Quadro K2200 is a low-end several generations old GPU and I strongly
doubt you will see any benefit from using it.

I suggest you try running
mdrun -nb gpu -ntmpi 1 -ntomp 36 -pin on
which will give you the (most likely best) performance you can get when
using both high-end Intel CPUs and the GPU.
Compare that to:
mdrun -nb cpu -ntmpi 1 -ntomp 36 -pin on
mdrun -nb cpu -ntmpi 36 -ntomp 2 -pin on [with or without -npme 0]
mdrun -nb cpu -ntmpi 18 -ntomp 4 -pin on [with or without -npme 0]
I suspect one of the the latter two will be fastest.

You can also try using no domain-decomposition and 72 threads per rank by
recompiling GROMACS and setting the
cmake . -DGMX_OPENMP_MAX_THREADS=128
option. I will could end up being faster than the first suggested CPU run,
but not likely to be faster (by a relevant amount of any) than the latter
two.

If the above technicalities are not clear and you would like to learn about
them to understand them better, I recommend reading (again?) the relevant
parts of the user guide.

Cheers,
--
Szilárd

On Wed, Feb 27, 2019 at 12:27 PM Lalehan Ozalp <lalehan.ozalp at gmail.com>
wrote:

> Dear Szilárd,
> There is indeed one GPU. And please keep in mind I used to exploit the -nt
> 72 option BEFORE the 2019-dev version. It looks like it employs GPU by
> default and I don't know how to efficiently use it, apparently.  Here is
> the info you asked for:
> System size: 130655 atoms
>
> .mdp file:
> ; Run parameters
> integrator              = md        ; leap-frog integrator
> nsteps                  = 15000000   ; 2 * 15000000 = 30000 ps (30 ns)
> dt                      = 0.002     ; 2 fs
> ; Output control
> nstenergy               = 5000      ; save energies every 10.0 ps
> nstlog                  = 5000      ; update log file every 10.0 ps
> nstxout-compressed      = 5000      ; save coordinates every 10.0 ps
> ; Bond parameters
> continuation            = yes       ; continuing from NPT
> constraint_algorithm    = lincs     ; holonomic constraints
> constraints             = h-bonds   ; bonds to H are constrained
> lincs_iter              = 1         ; accuracy of LINCS
> lincs_order             = 4         ; also related to accuracy
> ; Neighbor searching and vdW
> cutoff-scheme           = Verlet
> ns_type                 = grid      ; search neighboring grid cells
> nstlist                 = 20        ; largely irrelevant with Verlet
> rlist                   = 1.2
> vdwtype                 = cutoff
> vdw-modifier            = force-switch
> rvdw-switch             = 1.0
> rvdw                    = 1.2       ; short-range van der Waals cutoff (in
> nm)
> ; Electrostatics
> coulombtype             = PME       ; Particle Mesh Ewald for long-range
> electrostatics
> rcoulomb                = 1.2
> pme_order               = 4         ; cubic interpolation
> fourierspacing          = 0.16      ; grid spacing for FFT
> ; Temperature coupling
> tcoupl                  = V-rescale                     ; modified
> Berendsen thermostat
> tc-grps                 = Protein_nap_16 Water_and_ions    ; two coupling
> groups - more accurate
> tau_t                   = 0.1   0.1                     ; time constant, in
> ps
> ref_t                   = 300   300                     ; reference
> temperature, one for each group, in K
> ; Pressure coupling
> pcoupl                  = Parrinello-Rahman             ; pressure coupling
> is on for NPT
> pcoupltype              = isotropic                     ; uniform scaling
> of box vectors
>
>
>
> my command:
> gmx mdrun -deffnm md_0_30 -ntmpi 4 -ntomp 18 -npme 1 -pme gpu -nb gpu
>
>
>
> and what the program prints in the log file once I run it:
>
> GROMACS version:    2019-dev
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        CUDA
> SIMD instructions:  NONE
> FFT library:        fftw-3.3.8
> RDTSCP usage:       disabled
> TNG support:        enabled
> Hwloc support:      disabled
> Tracing support:    disabled
> Built on:           2019-01-22 13:53:24
> Build CPU vendor:   Unknown
> Build CPU brand:    Unknown
> Build CPU family:   0   Model: 0   Stepping: 0
> Build CPU features: Unknown
> C compiler:         /usr/local/bin/gcc GNU 5.3.0
> C++ compiler flags:     -std=c++11  -Wundef -Wextra
> -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations
> -Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
> -Wno-array-bounds
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
>
> Running on 1 node with total 36 cores, 72 logical cores, 1 compatible GPU
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>     Family: 6   Model: 63   Stepping: 2
>
>  GPU info:
>     Number of GPUs detected: 1
>     #0: NVIDIA Quadro K2200, compute cap.: 5.0, ECC:  no, stat: compatible
>
> Highest SIMD level requested by all nodes in run: AVX2_256
> SIMD instructions selected at compile time:       None
> This program was compiled for different hardware than you are running on,
> which could influence performance.
> The current CPU can measure timings more accurately than the code in
> gmx mdrun was configured to use. This might affect your simulation
> speed as accurate timings are needed for load-balancing.
>
>
>
> Hardware:
>
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                72
> On-line CPU(s) list:   0-71
> Thread(s) per core:    2
> Core(s) per socket:    18
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 63
> Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> Stepping:              2
> CPU MHz:               1200.000
> BogoMIPS:              4589.66
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              46080K
> NUMA node0 CPU(s):     0-17,36-53
> NUMA node1 CPU(s):     18-35,54-71
>
>
>
> GPU:
>
> 03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL
> [Quadro K2200] [10de:13ba] (rev a2) (prog-if 00 [VGA controller])
>         Subsystem: NVIDIA Corporation Device [10de:1097]
>         Physical Slot: 2
>         Flags: bus master, fast devsel, latency 0, IRQ 232
>         Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at c0000000 (64-bit, prefetchable) [size=256M]
>         Memory at d0000000 (64-bit, prefetchable) [size=32M]
>         I/O ports at 4000 [size=128]
>         [virtual] Expansion ROM at d3000000 [disabled] [size=512K]
>         Capabilities: <access denied>
>         Kernel driver in use: nvidia
>         Kernel modules: nvidia-drm, nvidia, nouveau, nvidiafb
>
>
> Hope I didn't flooded with too much information.
> Thank you very much for your interest.
> Best,
>
> Lalehan
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.