[gmx-users] how to increase GMX_OPENMP_MAX_THREADS

Wed Feb 27 12:26:39 CET 2019

Dear Szilárd,
There is indeed one GPU. And please keep in mind I used to exploit the -nt
72 option BEFORE the 2019-dev version. It looks like it employs GPU by
default and I don't know how to efficiently use it, apparently.  Here is
the info you asked for:
System size: 130655 atoms

.mdp file:
; Run parameters
integrator              = md        ; leap-frog integrator
nsteps                  = 15000000   ; 2 * 15000000 = 30000 ps (30 ns)
dt                      = 0.002     ; 2 fs
; Output control
nstenergy               = 5000      ; save energies every 10.0 ps
nstlog                  = 5000      ; update log file every 10.0 ps
nstxout-compressed      = 5000      ; save coordinates every 10.0 ps
; Bond parameters
continuation            = yes       ; continuing from NPT
constraint_algorithm    = lincs     ; holonomic constraints
constraints             = h-bonds   ; bonds to H are constrained
lincs_iter              = 1         ; accuracy of LINCS
lincs_order             = 4         ; also related to accuracy
; Neighbor searching and vdW
cutoff-scheme           = Verlet
ns_type                 = grid      ; search neighboring grid cells
nstlist                 = 20        ; largely irrelevant with Verlet
rlist                   = 1.2
vdwtype                 = cutoff
vdw-modifier            = force-switch
rvdw-switch             = 1.0
rvdw                    = 1.2       ; short-range van der Waals cutoff (in
nm)
; Electrostatics
coulombtype             = PME       ; Particle Mesh Ewald for long-range
electrostatics
rcoulomb                = 1.2
pme_order               = 4         ; cubic interpolation
fourierspacing          = 0.16      ; grid spacing for FFT
; Temperature coupling
tcoupl                  = V-rescale                     ; modified
Berendsen thermostat
tc-grps                 = Protein_nap_16 Water_and_ions    ; two coupling
groups - more accurate
tau_t                   = 0.1   0.1                     ; time constant, in
ps
ref_t                   = 300   300                     ; reference
temperature, one for each group, in K
; Pressure coupling
pcoupl                  = Parrinello-Rahman             ; pressure coupling
is on for NPT
pcoupltype              = isotropic                     ; uniform scaling
of box vectors

my command:
gmx mdrun -deffnm md_0_30 -ntmpi 4 -ntomp 18 -npme 1 -pme gpu -nb gpu

and what the program prints in the log file once I run it:

GROMACS version:    2019-dev
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  NONE
FFT library:        fftw-3.3.8
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
Built on:           2019-01-22 13:53:24
Build CPU vendor:   Unknown
Build CPU brand:    Unknown
Build CPU family:   0   Model: 0   Stepping: 0
Build CPU features: Unknown
C compiler:         /usr/local/bin/gcc GNU 5.3.0
C++ compiler flags:     -std=c++11  -Wundef -Wextra
-Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations
-Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
-Wno-array-bounds
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61

Running on 1 node with total 36 cores, 72 logical cores, 1 compatible GPU
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
    Family: 6   Model: 63   Stepping: 2

 GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Quadro K2200, compute cap.: 5.0, ECC:  no, stat: compatible

Highest SIMD level requested by all nodes in run: AVX2_256
SIMD instructions selected at compile time:       None
This program was compiled for different hardware than you are running on,
which could influence performance.
The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.

Hardware:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               1200.000
BogoMIPS:              4589.66
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71

GPU:

03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL
[Quadro K2200] [10de:13ba] (rev a2) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device [10de:1097]
        Physical Slot: 2
        Flags: bus master, fast devsel, latency 0, IRQ 232
        Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 4000 [size=128]
        [virtual] Expansion ROM at d3000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidia-drm, nvidia, nouveau, nvidiafb

Hope I didn't flooded with too much information.
Thank you very much for your interest.
Best,

Lalehan