[gmx-users] Excessive and gradually increasing memory usage with OpenCL

Thu Mar 29 00:17:04 CEST 2018

Thank you for this workaround!

Just setting the GMX_DISABLE_GPU_TIMING environment variable has
allowed mdrun to progress for several million steps. The memory usage
is still high at about 1 GB memory and 26 GB swap, but it does not
appear to increase as the simulation progresses.

I tried 6 ranks x 2 threads as well, but performance was unchanged. I
think it's because the CPUs are spending time waiting for the GPUs;
Mark's suggestion to switch to native CUDA would probably make a
significant difference here. If this is an important recommendation,
the Gromacs installation guide should probably link to
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html,
which clarifies that even the latest release of CUDA does not come
close to being compatible with the latest version of GCC.

-Albert Mao

On Tue, Mar 27, 2018 at 4:43 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
> Hi,
>
> This is an issue I noticed recently, but I thought it was only
> affecting some use-cases (or some runtimes). However, it seems it's a
> broader problem. It is under investigation, but for now it seems that
> eliminate it (or strongly diminish its effects) by turning off
> GPU-side task timing. You can do that by setting the
> GMX_DISABLE_GPU_TIMING environment variable.
>
> Note that this is workaround that may turn out to not be a complete
> solution, please report back if you've done longer runs.
>
> Regarding the thread count, the MPI and CUDA runtimes can spawn
> threads, GROMACS certainly used 3x 4 threads in your case. Note that
> you will likely get better performance by using 6 ranks x 2 threads
> (both because this avoids ranks spanning across sockets and it allows
> GPU task/transfer overlap).
>
> Cheers,
> --
> Szilárd
>
>
> On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com> wrote:
>> Hello!
>>
>> I'm trying to run molecular dynamics on a fairly large system
>> containing approximately 250000 atoms. The simulation runs well for
>> about 100000 steps and then gets killed by the queueing engine due to
>> exceeding the swap space usage limit. The compute node I'm using has
>> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
>> GROMACS 2018 and allowing mdrun to delegate the workload
>> automatically, resulting in three thread-MPI ranks each with one GPU
>> and four OpenMP threads. The queueing engine reports the following
>> usage:
>>
>> TERM_SWAP: job killed after reaching LSF swap usage limit.
>> Exited with exit code 131.
>> Resource usage summary:
>>     CPU time   :  50123.00 sec.
>>     Max Memory :      4671 MB
>>     Max Swap   :     30020 MB
>>     Max Processes  :         5
>>     Max Threads    :        35
>>
>> Even though it's a large system, by my rough estimate, the simulation
>> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
>> like too much and 30 GB is completely ridiculous.
>> Indeed, running the system on a similar node without GPUs is working
>> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
>>
>> I also don't understand why 35 threads got created.
>>
>> Could there be a memory leak somewhere in the OpenCL code? Any
>> suggestions on preventing this memory usage expansion would be greatly
>> appreciated.
>>
>> I've included relevant output from mdrun with system and configuration
>> information at the end of this message. I'm using OpenCL despite
>> having Nvidia GPUs because of a sad problem where building with CUDA
>> support fails due to the C compiler being "too new".
>>
>> Thanks!
>> -Albert Mao
>>
>> GROMACS:      gmx mdrun, version 2018
>> Executable:   /data/albertmaolab/software/gromacs/bin/gmx
>> Data prefix:  /data/albertmaolab/software/gromacs
>> Command line:
>>
>>   gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
>>
>> GROMACS version:    2018
>> Precision:          single
>> Memory model:       64 bit
>> MPI library:        thread_mpi
>> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
>> GPU support:        OpenCL
>> SIMD instructions:  SSE4.1
>> FFT library:        fftw-3.2.1
>> RDTSCP usage:       disabled
>> TNG support:        enabled
>> Hwloc support:      hwloc-1.5.0
>> Tracing support:    disabled
>> Built on:           2018-02-22 07:25:43
>> Built by:           ahm17 at eris1pm01.research.partners.org [CMAKE]
>> Build OS/arch:      Linux 2.6.32-431.29.2.el6.x86_64 x86_64
>> Build CPU vendor:   Intel
>> Build CPU brand:    Common KVM processor
>> Build CPU family:   15   Model: 6   Stepping: 1
>> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
>> ssse3
>> C compiler:         /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
>> C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
>> -fexcess-precision=fast
>> C++ compiler:       /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
>> C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
>> -funroll-all-loops -fexcess-precision=fast
>> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
>> OpenCL library:     /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
>> OpenCL version:     1.2
>>
>> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
>> Hardware detected:
>>   CPU info:
>>     Vendor: Intel
>>     Brand:  Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
>>     Family: 6   Model: 44   Stepping: 2
>>     Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
>> sse4.1 sse4.2 ssse3
>>   Hardware topology: Full, with devices
>>     Sockets, cores, and logical processors:
>>       Socket  0: [   0] [   2] [   4] [   6] [   8] [  10]
>>       Socket  1: [   1] [   3] [   5] [   7] [   9] [  11]
>>     Numa nodes:
>>       Node  0 (25759080448 bytes mem):   0   2   4   6   8  10
>>       Node  1 (25769799680 bytes mem):   1   3   5   7   9  11
>>       Latency:
>>                0     1
>>          0  1.00  2.00
>>          1  2.00  1.00
>>     Caches:
>>       L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>>       L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>>       L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
>>     PCI devices:
>>       0000:04:00.0  Id: 8086:10c9  Class: 0x0200  Numa: -1
>>       0000:04:00.1  Id: 8086:10c9  Class: 0x0200  Numa: -1
>>       0000:05:00.0  Id: 15b3:6746  Class: 0x0280  Numa: -1
>>       0000:06:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>>       0000:01:03.0  Id: 1002:515e  Class: 0x0300  Numa: -1
>>       0000:00:1f.2  Id: 8086:3a20  Class: 0x0101  Numa: -1
>>       0000:00:1f.5  Id: 8086:3a26  Class: 0x0101  Numa: -1
>>       0000:14:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>>       0000:11:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>>   GPU info:
>>     Number of GPUs detected: 3
>>     #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>>     #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>>     #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
>> OpenCL 1.1 CUDA, stat: compatible
>>
>> (later)
>>
>> Using 3 MPI threads
>> Using 4 OpenMP threads per tMPI thread
>> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
>> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
>>   PP:0,PP:1,PP:2
>> Pinning threads with an auto-selected logical core stride of 1
>> System total charge: 0.000
>> Will do PME sum in reciprocal space for electrostatic interactions.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.