[gmx-users] Excessive and gradually increasing memory usage with OpenCL
Szilárd Páll
pall.szilard at gmail.com
Tue Mar 27 22:43:28 CEST 2018
Hi,
This is an issue I noticed recently, but I thought it was only
affecting some use-cases (or some runtimes). However, it seems it's a
broader problem. It is under investigation, but for now it seems that
eliminate it (or strongly diminish its effects) by turning off
GPU-side task timing. You can do that by setting the
GMX_DISABLE_GPU_TIMING environment variable.
Note that this is workaround that may turn out to not be a complete
solution, please report back if you've done longer runs.
Regarding the thread count, the MPI and CUDA runtimes can spawn
threads, GROMACS certainly used 3x 4 threads in your case. Note that
you will likely get better performance by using 6 ranks x 2 threads
(both because this avoids ranks spanning across sockets and it allows
GPU task/transfer overlap).
Cheers,
--
Szilárd
On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com> wrote:
> Hello!
>
> I'm trying to run molecular dynamics on a fairly large system
> containing approximately 250000 atoms. The simulation runs well for
> about 100000 steps and then gets killed by the queueing engine due to
> exceeding the swap space usage limit. The compute node I'm using has
> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
> GROMACS 2018 and allowing mdrun to delegate the workload
> automatically, resulting in three thread-MPI ranks each with one GPU
> and four OpenMP threads. The queueing engine reports the following
> usage:
>
> TERM_SWAP: job killed after reaching LSF swap usage limit.
> Exited with exit code 131.
> Resource usage summary:
> CPU time : 50123.00 sec.
> Max Memory : 4671 MB
> Max Swap : 30020 MB
> Max Processes : 5
> Max Threads : 35
>
> Even though it's a large system, by my rough estimate, the simulation
> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
> like too much and 30 GB is completely ridiculous.
> Indeed, running the system on a similar node without GPUs is working
> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
>
> I also don't understand why 35 threads got created.
>
> Could there be a memory leak somewhere in the OpenCL code? Any
> suggestions on preventing this memory usage expansion would be greatly
> appreciated.
>
> I've included relevant output from mdrun with system and configuration
> information at the end of this message. I'm using OpenCL despite
> having Nvidia GPUs because of a sad problem where building with CUDA
> support fails due to the C compiler being "too new".
>
> Thanks!
> -Albert Mao
>
> GROMACS: gmx mdrun, version 2018
> Executable: /data/albertmaolab/software/gromacs/bin/gmx
> Data prefix: /data/albertmaolab/software/gromacs
> Command line:
>
> gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
>
> GROMACS version: 2018
> Precision: single
> Memory model: 64 bit
> MPI library: thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support: OpenCL
> SIMD instructions: SSE4.1
> FFT library: fftw-3.2.1
> RDTSCP usage: disabled
> TNG support: enabled
> Hwloc support: hwloc-1.5.0
> Tracing support: disabled
> Built on: 2018-02-22 07:25:43
> Built by: ahm17 at eris1pm01.research.partners.org [CMAKE]
> Build OS/arch: Linux 2.6.32-431.29.2.el6.x86_64 x86_64
> Build CPU vendor: Intel
> Build CPU brand: Common KVM processor
> Build CPU family: 15 Model: 6 Stepping: 1
> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
> ssse3
> C compiler: /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
> C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler: /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
> C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
> OpenCL library: /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
> OpenCL version: 1.2
>
> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
> Hardware detected:
> CPU info:
> Vendor: Intel
> Brand: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
> Family: 6 Model: 44 Stepping: 2
> Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3
> Hardware topology: Full, with devices
> Sockets, cores, and logical processors:
> Socket 0: [ 0] [ 2] [ 4] [ 6] [ 8] [ 10]
> Socket 1: [ 1] [ 3] [ 5] [ 7] [ 9] [ 11]
> Numa nodes:
> Node 0 (25759080448 bytes mem): 0 2 4 6 8 10
> Node 1 (25769799680 bytes mem): 1 3 5 7 9 11
> Latency:
> 0 1
> 0 1.00 2.00
> 1 2.00 1.00
> Caches:
> L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
> L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
> L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
> PCI devices:
> 0000:04:00.0 Id: 8086:10c9 Class: 0x0200 Numa: -1
> 0000:04:00.1 Id: 8086:10c9 Class: 0x0200 Numa: -1
> 0000:05:00.0 Id: 15b3:6746 Class: 0x0280 Numa: -1
> 0000:06:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
> 0000:01:03.0 Id: 1002:515e Class: 0x0300 Numa: -1
> 0000:00:1f.2 Id: 8086:3a20 Class: 0x0101 Numa: -1
> 0000:00:1f.5 Id: 8086:3a26 Class: 0x0101 Numa: -1
> 0000:14:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
> 0000:11:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
> GPU info:
> Number of GPUs detected: 3
> #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
> #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
> #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
>
> (later)
>
> Using 3 MPI threads
> Using 4 OpenMP threads per tMPI thread
> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
> PP:0,PP:1,PP:2
> Pinning threads with an auto-selected logical core stride of 1
> System total charge: 0.000
> Will do PME sum in reciprocal space for electrostatic interactions.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list