[gmx-users] Excessive and gradually increasing memory usage with OpenCL

Tue Mar 27 16:09:58 CEST 2018

Hello!

I'm trying to run molecular dynamics on a fairly large system
containing approximately 250000 atoms. The simulation runs well for
about 100000 steps and then gets killed by the queueing engine due to
exceeding the swap space usage limit. The compute node I'm using has
12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
GROMACS 2018 and allowing mdrun to delegate the workload
automatically, resulting in three thread-MPI ranks each with one GPU
and four OpenMP threads. The queueing engine reports the following
usage:

TERM_SWAP: job killed after reaching LSF swap usage limit.
Exited with exit code 131.
Resource usage summary:
    CPU time   :  50123.00 sec.
    Max Memory :      4671 MB
    Max Swap   :     30020 MB
    Max Processes  :         5
    Max Threads    :        35

Even though it's a large system, by my rough estimate, the simulation
should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
like too much and 30 GB is completely ridiculous.
Indeed, running the system on a similar node without GPUs is working
well (but slowly), consuming about 0.65 GB and 2 GB of swap.

I also don't understand why 35 threads got created.

Could there be a memory leak somewhere in the OpenCL code? Any
suggestions on preventing this memory usage expansion would be greatly
appreciated.

I've included relevant output from mdrun with system and configuration
information at the end of this message. I'm using OpenCL despite
having Nvidia GPUs because of a sad problem where building with CUDA
support fails due to the C compiler being "too new".

Thanks!
-Albert Mao

GROMACS:      gmx mdrun, version 2018
Executable:   /data/albertmaolab/software/gromacs/bin/gmx
Data prefix:  /data/albertmaolab/software/gromacs
Command line:

  gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt

GROMACS version:    2018
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        OpenCL
SIMD instructions:  SSE4.1
FFT library:        fftw-3.2.1
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.5.0
Tracing support:    disabled
Built on:           2018-02-22 07:25:43
Built by:           ahm17 at eris1pm01.research.partners.org [CMAKE]
Build OS/arch:      Linux 2.6.32-431.29.2.el6.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Common KVM processor
Build CPU family:   15   Model: 6   Stepping: 1
Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
ssse3
C compiler:         /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:       /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
OpenCL library:     /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
OpenCL version:     1.2

Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
    Family: 6   Model: 44   Stepping: 2
    Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
sse4.1 sse4.2 ssse3
  Hardware topology: Full, with devices
    Sockets, cores, and logical processors:
      Socket  0: [   0] [   2] [   4] [   6] [   8] [  10]
      Socket  1: [   1] [   3] [   5] [   7] [   9] [  11]
    Numa nodes:
      Node  0 (25759080448 bytes mem):   0   2   4   6   8  10
      Node  1 (25769799680 bytes mem):   1   3   5   7   9  11
      Latency:
               0     1
         0  1.00  2.00
         1  2.00  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
    PCI devices:
      0000:04:00.0  Id: 8086:10c9  Class: 0x0200  Numa: -1
      0000:04:00.1  Id: 8086:10c9  Class: 0x0200  Numa: -1
      0000:05:00.0  Id: 15b3:6746  Class: 0x0280  Numa: -1
      0000:06:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
      0000:01:03.0  Id: 1002:515e  Class: 0x0300  Numa: -1
      0000:00:1f.2  Id: 8086:3a20  Class: 0x0101  Numa: -1
      0000:00:1f.5  Id: 8086:3a26  Class: 0x0101  Numa: -1
      0000:14:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
      0000:11:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
  GPU info:
    Number of GPUs detected: 3
    #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible
    #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible
    #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible

(later)

Using 3 MPI threads
Using 4 OpenMP threads per tMPI thread
On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
  PP:0,PP:1,PP:2
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.