[gmx-users] Excessive and gradually increasing memory usage with OpenCL
Albert Mao
albert.mao at gmail.com
Thu Mar 29 00:17:04 CEST 2018
Thank you for this workaround!
Just setting the GMX_DISABLE_GPU_TIMING environment variable has
allowed mdrun to progress for several million steps. The memory usage
is still high at about 1 GB memory and 26 GB swap, but it does not
appear to increase as the simulation progresses.
I tried 6 ranks x 2 threads as well, but performance was unchanged. I
think it's because the CPUs are spending time waiting for the GPUs;
Mark's suggestion to switch to native CUDA would probably make a
significant difference here. If this is an important recommendation,
the Gromacs installation guide should probably link to
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html,
which clarifies that even the latest release of CUDA does not come
close to being compatible with the latest version of GCC.
-Albert Mao
On Tue, Mar 27, 2018 at 4:43 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> Hi,
>
> This is an issue I noticed recently, but I thought it was only
> affecting some use-cases (or some runtimes). However, it seems it's a
> broader problem. It is under investigation, but for now it seems that
> eliminate it (or strongly diminish its effects) by turning off
> GPU-side task timing. You can do that by setting the
> GMX_DISABLE_GPU_TIMING environment variable.
>
> Note that this is workaround that may turn out to not be a complete
> solution, please report back if you've done longer runs.
>
> Regarding the thread count, the MPI and CUDA runtimes can spawn
> threads, GROMACS certainly used 3x 4 threads in your case. Note that
> you will likely get better performance by using 6 ranks x 2 threads
> (both because this avoids ranks spanning across sockets and it allows
> GPU task/transfer overlap).
>
> Cheers,
> --
> Szilárd
>
>
> On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com> wrote:
>> Hello!
>>
>> I'm trying to run molecular dynamics on a fairly large system
>> containing approximately 250000 atoms. The simulation runs well for
>> about 100000 steps and then gets killed by the queueing engine due to
>> exceeding the swap space usage limit. The compute node I'm using has
>> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
>> GROMACS 2018 and allowing mdrun to delegate the workload
>> automatically, resulting in three thread-MPI ranks each with one GPU
>> and four OpenMP threads. The queueing engine reports the following
>> usage:
>>
>> TERM_SWAP: job killed after reaching LSF swap usage limit.
>> Exited with exit code 131.
>> Resource usage summary:
>> CPU time : 50123.00 sec.
>> Max Memory : 4671 MB
>> Max Swap : 30020 MB
>> Max Processes : 5
>> Max Threads : 35
>>
>> Even though it's a large system, by my rough estimate, the simulation
>> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
>> like too much and 30 GB is completely ridiculous.
>> Indeed, running the system on a similar node without GPUs is working
>> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
>>
>> I also don't understand why 35 threads got created.
>>
>> Could there be a memory leak somewhere in the OpenCL code? Any
>> suggestions on preventing this memory usage expansion would be greatly
>> appreciated.
>>
>> I've included relevant output from mdrun with system and configuration
>> information at the end of this message. I'm using OpenCL despite
>> having Nvidia GPUs because of a sad problem where building with CUDA
>> support fails due to the C compiler being "too new".
>>
>> Thanks!
>> -Albert Mao
>>
>> GROMACS: gmx mdrun, version 2018
>> Executable: /data/albertmaolab/software/gromacs/bin/gmx
>> Data prefix: /data/albertmaolab/software/gromacs
>> Command line:
>>
>> gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
>>
>> GROMACS version: 2018
>> Precision: single
>> Memory model: 64 bit
>> MPI library: thread_mpi
>> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
>> GPU support: OpenCL
>> SIMD instructions: SSE4.1
>> FFT library: fftw-3.2.1
>> RDTSCP usage: disabled
>> TNG support: enabled
>> Hwloc support: hwloc-1.5.0
>> Tracing support: disabled
>> Built on: 2018-02-22 07:25:43
>> Built by: ahm17 at eris1pm01.research.partners.org [CMAKE]
>> Build OS/arch: Linux 2.6.32-431.29.2.el6.x86_64 x86_64
>> Build CPU vendor: Intel
>> Build CPU brand: Common KVM processor
>> Build CPU family: 15 Model: 6 Stepping: 1
>> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
>> ssse3
>> C compiler: /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
>> C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
>> -fexcess-precision=fast
>> C++ compiler: /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
>> C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG
>> -funroll-all-loops -fexcess-precision=fast
>> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
>> OpenCL library: /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
>> OpenCL version: 1.2
>>
>> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
>> Hardware detected:
>> CPU info:
>> Vendor: Intel
>> Brand: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
>> Family: 6 Model: 44 Stepping: 2
>> Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
>> sse4.1 sse4.2 ssse3
>> Hardware topology: Full, with devices
>> Sockets, cores, and logical processors:
>> Socket 0: [ 0] [ 2] [ 4] [ 6] [ 8] [ 10]
>> Socket 1: [ 1] [ 3] [ 5] [ 7] [ 9] [ 11]
>> Numa nodes:
>> Node 0 (25759080448 bytes mem): 0 2 4 6 8 10
>> Node 1 (25769799680 bytes mem): 1 3 5 7 9 11
>> Latency:
>> 0 1
>> 0 1.00 2.00
>> 1 2.00 1.00
>> Caches:
>> L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>> L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>> L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
>> PCI devices:
>> 0000:04:00.0 Id: 8086:10c9 Class: 0x0200 Numa: -1
>> 0000:04:00.1 Id: 8086:10c9 Class: 0x0200 Numa: -1
>> 0000:05:00.0 Id: 15b3:6746 Class: 0x0280 Numa: -1
>> 0000:06:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> 0000:01:03.0 Id: 1002:515e Class: 0x0300 Numa: -1
>> 0000:00:1f.2 Id: 8086:3a20 Class: 0x0101 Numa: -1
>> 0000:00:1f.5 Id: 8086:3a26 Class: 0x0101 Numa: -1
>> 0000:14:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> 0000:11:00.0 Id: 10de:06d2 Class: 0x0302 Numa: -1
>> GPU info:
>> Number of GPUs detected: 3
>> #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>> #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>> #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>>
>> (later)
>>
>> Using 3 MPI threads
>> Using 4 OpenMP threads per tMPI thread
>> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
>> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
>> PP:0,PP:1,PP:2
>> Pinning threads with an auto-selected logical core stride of 1
>> System total charge: 0.000
>> Will do PME sum in reciprocal space for electrostatic interactions.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list