[gmx-users] Excessive and gradually increasing memory usage with OpenCL
Albert Mao
albert.mao at gmail.com
Tue Mar 27 16:09:58 CEST 2018
Hello!
I'm trying to run molecular dynamics on a fairly large system
containing approximately 250000 atoms. The simulation runs well for
about 100000 steps and then gets killed by the queueing engine due to
exceeding the swap space usage limit. The compute node I'm using has
12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
GROMACS 2018 and allowing mdrun to delegate the workload
automatically, resulting in three thread-MPI ranks each with one GPU
and four OpenMP threads. The queueing engine reports the following
usage:
TERM_SWAP: job killed after reaching LSF swap usage limit.
Exited with exit code 131.
Resource usage summary:
CPU time : 50123.00 sec.
Max Memory : 4671 MB
Max Swap : 30020 MB
Max Processes : 5
Max Threads : 35
Even though it's a large system, by my rough estimate, the simulation
should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
like too much and 30 GB is completely ridiculous.
Indeed, running the system on a similar node without GPUs is working
well (but slowly), consuming about 0.65 GB and 2 GB of swap.
I also don't understand why 35 threads got created.
Could there be a memory leak somewhere in the OpenCL code? Any
suggestions on preventing this memory usage expansion would be greatly
appreciated.
I've included relevant output from mdrun with system and configuration
information at the end of this message. I'm using OpenCL despite
having Nvidia GPUs because of a sad problem where building with CUDA
support fails due to the C compiler being "too new".
Thanks!
-Albert Mao
GROMACS: gmx mdrun, version 2018
Executable: /data/albertmaolab/software/gromacs/bin/gmx
Data prefix: /data/albertmaolab/software/gromacs
Command line:
gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
GROMACS version: 2018
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: OpenCL
SIMD instructions: SSE4.1
FFT library: fftw-3.2.1
RDTSCP usage: disabled
TNG support: enabled
Hwloc support: hwloc-1.5.0
Tracing support: disabled
Built on: 2018-02-22 07:25:43
Built by: ahm17 at eris1pm01.research.partners.org [CMAKE]
Build OS/arch: Linux 2.6.32-431.29.2.el6.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Common KVM processor
Build CPU family: 15 Model: 6 Stepping: 1
Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
ssse3
C compiler: /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
OpenCL library: /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
OpenCL version: 1.2
Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
Family: 6 Model: 44 Stepping: 2
Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
sse4.1 sse4.2 ssse3
Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 2] [ 4] [ 6] [ 8] [ 10]
Socket 1: [ 1] [ 3] [ 5] [ 7] [ 9] [ 11]
Numa nodes:
Node 0 (25759080448 bytes mem): 0 2 4 6 8 10
Node 1 (25769799680 bytes mem): 1 3 5 7 9 11
Latency:
0 1
0 1.00 2.00
1 2.00 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
PCI devices:
0000:04:00.0 Id: 8086:10c9 Class: 0x0200 Numa: -1
0000:04:00.1 Id: 8086:10c9 Class: 0x0200 Numa: -1
0000:05:00.0 Id: 15b3:6746 Class: 0x0280 Numa: -1
0000:06:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
0000:01:03.0 Id: 1002:515e Class: 0x0300 Numa: -1
0000:00:1f.2 Id: 8086:3a20 Class: 0x0101 Numa: -1
0000:00:1f.5 Id: 8086:3a26 Class: 0x0101 Numa: -1
0000:14:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
0000:11:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
GPU info:
Number of GPUs detected: 3
#0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible
#1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible
#2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
OpenCL 1.1 CUDA, stat: compatible
(later)
Using 3 MPI threads
Using 4 OpenMP threads per tMPI thread
On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
PP:0,PP:1,PP:2
Pinning threads with an auto-selected logical core stride of 1
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.
More information about the gromacs.org_gmx-users
mailing list