[gmx-users] Performance, gpu

Fri Aug 23 20:59:48 CEST 2019

Dear Gromacs user,
Using a machine with below configurations and also below command I tried to
simulate a system with 479K atoms (mainly water) on CPU-GPU, the
performance is around 1ns per 1 hour.
According the information and also shared log file below, I would be so
appreciated if you could comment on the submission command to improve the
performance by involving better the GPU and CPU.

%------------------------------------------------
#PBS -l select=4:ncpus=22:mpiprocs=22:ngpus=1
export OMP_NUM_THREADS=4

aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
-gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes

Running on 4 nodes with total 88 cores, 176 logical cores, 4 compatible GPUs
  Cores per node:           22
  Logical cores per node:   44
  Compatible GPUs per node:  1
  All nodes have identical type(s) of GPUs

%------------------------------------------------
GROMACS version:    2018.1
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        commercial-fftw-3.3.6-pl1-fma-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.0
Tracing support:    disabled
Built on:           2018-09-12 20:34:33
Built by:           xxxx
Build OS/arch:      Linux 3.12.61-52.111-default x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt
intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /opt/cray/pe/craype/2.5.13/bin/cc GNU 5.3.0
C compiler flags:    -march=core-avx2     -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:       /opt/cray/pe/craype/2.5.13/bin/CC GNU 5.3.0
C++ compiler flags:  -march=core-avx2    -std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:
/opt/nvidia/cudatoolkit8.0/8.0.61_2.3.13_g32c34f9-2.1/bin/nvcc nvcc: NVIDIA
(R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built
on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler
flags:-gencode;arch=compute_60,code=sm_60;-use_fast_math;-Wno-deprecated-gpu-targets;;;
;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        9.20
CUDA runtime:       8.0
%-------------------------------------------------
Log file:
https://drive.google.com/open?id=1-myQ5rP85UWKb1262QDPa6kYhuzHPzLu

Thank you,
Alex