[gmx-users] Excessive and gradually increasing memory usage with OpenCL
albert.mao at gmail.com
Thu Mar 29 00:17:04 CEST 2018
Thank you for this workaround!
Just setting the GMX_DISABLE_GPU_TIMING environment variable has
allowed mdrun to progress for several million steps. The memory usage
is still high at about 1 GB memory and 26 GB swap, but it does not
appear to increase as the simulation progresses.
I tried 6 ranks x 2 threads as well, but performance was unchanged. I
think it's because the CPUs are spending time waiting for the GPUs;
Mark's suggestion to switch to native CUDA would probably make a
significant difference here. If this is an important recommendation,
the Gromacs installation guide should probably link to
which clarifies that even the latest release of CUDA does not come
close to being compatible with the latest version of GCC.
On Tue, Mar 27, 2018 at 4:43 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> This is an issue I noticed recently, but I thought it was only
> affecting some use-cases (or some runtimes). However, it seems it's a
> broader problem. It is under investigation, but for now it seems that
> eliminate it (or strongly diminish its effects) by turning off
> GPU-side task timing. You can do that by setting the
> GMX_DISABLE_GPU_TIMING environment variable.
> Note that this is workaround that may turn out to not be a complete
> solution, please report back if you've done longer runs.
> Regarding the thread count, the MPI and CUDA runtimes can spawn
> threads, GROMACS certainly used 3x 4 threads in your case. Note that
> you will likely get better performance by using 6 ranks x 2 threads
> (both because this avoids ranks spanning across sockets and it allows
> GPU task/transfer overlap).
> On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com> wrote:
>> I'm trying to run molecular dynamics on a fairly large system
>> containing approximately 250000 atoms. The simulation runs well for
>> about 100000 steps and then gets killed by the queueing engine due to
>> exceeding the swap space usage limit. The compute node I'm using has
>> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
>> GROMACS 2018 and allowing mdrun to delegate the workload
>> automatically, resulting in three thread-MPI ranks each with one GPU
>> and four OpenMP threads. The queueing engine reports the following
>> TERM_SWAP: job killed after reaching LSF swap usage limit.
>> Exited with exit code 131.
>> Resource usage summary:
>> CPU time : 50123.00 sec.
>> Max Memory : 4671 MB
>> Max Swap : 30020 MB
>> Max Processes : 5
>> Max Threads : 35
>> Even though it's a large system, by my rough estimate, the simulation
>> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
>> like too much and 30 GB is completely ridiculous.
>> Indeed, running the system on a similar node without GPUs is working
>> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
>> I also don't understand why 35 threads got created.
>> Could there be a memory leak somewhere in the OpenCL code? Any
>> suggestions on preventing this memory usage expansion would be greatly
>> I've included relevant output from mdrun with system and configuration
>> information at the end of this message. I'm using OpenCL despite
>> having Nvidia GPUs because of a sad problem where building with CUDA
>> support fails due to the C compiler being "too new".
>> -Albert Mao
>> GROMACS: gmx mdrun, version 2018
>> Executable: /data/albertmaolab/software/gromacs/bin/gmx
>> Data prefix: /data/albertmaolab/software/gromacs
>> Command line:
>> gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
>> GROMACS version: 2018
>> Precision: single
>> Memory model: 64 bit
>> MPI library: thread_mpi
>> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
>> GPU support: OpenCL
>> SIMD instructions: SSE4.1
>> FFT library: fftw-3.2.1
>> RDTSCP usage: disabled
>> TNG support: enabled
>> Hwloc support: hwloc-1.5.0
>> Tracing support: disabled
>> Built on: 2018-02-22 07:25:43
>> Built by: ahm17 at eris1pm01.research.partners.org [CMAKE]
>> Build OS/arch: Linux 2.6.32-431.29.2.el6.x86_64 x86_64
>> Build CPU vendor: Intel
>> Build CPU brand: Common KVM processor
>> Build CPU family: 15 Model: 6 Stepping: 1
>> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
>> C compiler: /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
>> C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
>> C++ compiler: /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
>> C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG
>> -funroll-all-loops -fexcess-precision=fast
>> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
>> OpenCL library: /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
>> OpenCL version: 1.2
>> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
>> Hardware detected:
>> CPU info:
>> Vendor: Intel
>> Brand: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
>> Family: 6 Model: 44 Stepping: 2
>> Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
>> sse4.1 sse4.2 ssse3
>> Hardware topology: Full, with devices
>> Sockets, cores, and logical processors:
>> Socket 0: [ 0] [ 2] [ 4] [ 6] [ 8] [ 10]
>> Socket 1: [ 1] [ 3] [ 5] [ 7] [ 9] [ 11]
>> Numa nodes:
>> Node 0 (25759080448 bytes mem): 0 2 4 6 8 10
>> Node 1 (25769799680 bytes mem): 1 3 5 7 9 11
>> 0 1
>> 0 1.00 2.00
>> 1 2.00 1.00
>> L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>> L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>> L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
>> PCI devices:
>> 0000:04:00.0 Id: 8086:10c9 Class: 0x0200 Numa: -1
>> 0000:04:00.1 Id: 8086:10c9 Class: 0x0200 Numa: -1
>> 0000:05:00.0 Id: 15b3:6746 Class: 0x0280 Numa: -1
>> 0000:06:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> 0000:01:03.0 Id: 1002:515e Class: 0x0300 Numa: -1
>> 0000:00:1f.2 Id: 8086:3a20 Class: 0x0101 Numa: -1
>> 0000:00:1f.5 Id: 8086:3a26 Class: 0x0101 Numa: -1
>> 0000:14:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> 0000:11:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> GPU info:
>> Number of GPUs detected: 3
>> #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>> #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>> #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>> Using 3 MPI threads
>> Using 4 OpenMP threads per tMPI thread
>> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
>> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
>> Pinning threads with an auto-selected logical core stride of 1
>> System total charge: 0.000
>> Will do PME sum in reciprocal space for electrostatic interactions.
>> Gromacs Users mailing list
>> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
> Gromacs Users mailing list
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users