[gmx-users] Excessive and gradually increasing memory usage with OpenCL

Tue Mar 27 22:43:28 CEST 2018

Hi,

This is an issue I noticed recently, but I thought it was only
affecting some use-cases (or some runtimes). However, it seems it's a
broader problem. It is under investigation, but for now it seems that
eliminate it (or strongly diminish its effects) by turning off
GPU-side task timing. You can do that by setting the
GMX_DISABLE_GPU_TIMING environment variable.

Note that this is workaround that may turn out to not be a complete
solution, please report back if you've done longer runs.

Regarding the thread count, the MPI and CUDA runtimes can spawn
threads, GROMACS certainly used 3x 4 threads in your case. Note that
you will likely get better performance by using 6 ranks x 2 threads
(both because this avoids ranks spanning across sockets and it allows
GPU task/transfer overlap).

Cheers,
--
Szilárd

On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com> wrote:
> Hello!
>
> I'm trying to run molecular dynamics on a fairly large system
> containing approximately 250000 atoms. The simulation runs well for
> about 100000 steps and then gets killed by the queueing engine due to
> exceeding the swap space usage limit. The compute node I'm using has
> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
> GROMACS 2018 and allowing mdrun to delegate the workload
> automatically, resulting in three thread-MPI ranks each with one GPU
> and four OpenMP threads. The queueing engine reports the following
> usage:
>
> TERM_SWAP: job killed after reaching LSF swap usage limit.
> Exited with exit code 131.
> Resource usage summary:
>     CPU time   :  50123.00 sec.
>     Max Memory :      4671 MB
>     Max Swap   :     30020 MB
>     Max Processes  :         5
>     Max Threads    :        35
>
> Even though it's a large system, by my rough estimate, the simulation
> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
> like too much and 30 GB is completely ridiculous.
> Indeed, running the system on a similar node without GPUs is working
> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
>
> I also don't understand why 35 threads got created.
>
> Could there be a memory leak somewhere in the OpenCL code? Any
> suggestions on preventing this memory usage expansion would be greatly
> appreciated.
>
> I've included relevant output from mdrun with system and configuration
> information at the end of this message. I'm using OpenCL despite
> having Nvidia GPUs because of a sad problem where building with CUDA
> support fails due to the C compiler being "too new".
>
> Thanks!
> -Albert Mao
>
> GROMACS:      gmx mdrun, version 2018
> Executable:   /data/albertmaolab/software/gromacs/bin/gmx
> Data prefix:  /data/albertmaolab/software/gromacs
> Command line:
>
>   gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
>
> GROMACS version:    2018
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        OpenCL
> SIMD instructions:  SSE4.1
> FFT library:        fftw-3.2.1
> RDTSCP usage:       disabled
> TNG support:        enabled
> Hwloc support:      hwloc-1.5.0
> Tracing support:    disabled
> Built on:           2018-02-22 07:25:43
> Built by:           ahm17 at eris1pm01.research.partners.org [CMAKE]
> Build OS/arch:      Linux 2.6.32-431.29.2.el6.x86_64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Common KVM processor
> Build CPU family:   15   Model: 6   Stepping: 1
> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
> ssse3
> C compiler:         /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
> C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler:       /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
> C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
> OpenCL library:     /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
> OpenCL version:     1.2
>
> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible GPUs
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
>     Family: 6   Model: 44   Stepping: 2
>     Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
> sse4.1 sse4.2 ssse3
>   Hardware topology: Full, with devices
>     Sockets, cores, and logical processors:
>       Socket  0: [   0] [   2] [   4] [   6] [   8] [  10]
>       Socket  1: [   1] [   3] [   5] [   7] [   9] [  11]
>     Numa nodes:
>       Node  0 (25759080448 bytes mem):   0   2   4   6   8  10
>       Node  1 (25769799680 bytes mem):   1   3   5   7   9  11
>       Latency:
>                0     1
>          0  1.00  2.00
>          1  2.00  1.00
>     Caches:
>       L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>       L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
>       L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
>     PCI devices:
>       0000:04:00.0  Id: 8086:10c9  Class: 0x0200  Numa: -1
>       0000:04:00.1  Id: 8086:10c9  Class: 0x0200  Numa: -1
>       0000:05:00.0  Id: 15b3:6746  Class: 0x0280  Numa: -1
>       0000:06:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>       0000:01:03.0  Id: 1002:515e  Class: 0x0300  Numa: -1
>       0000:00:1f.2  Id: 8086:3a20  Class: 0x0101  Numa: -1
>       0000:00:1f.5  Id: 8086:3a26  Class: 0x0101  Numa: -1
>       0000:14:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>       0000:11:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
>   GPU info:
>     Number of GPUs detected: 3
>     #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
>     #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
>     #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> OpenCL 1.1 CUDA, stat: compatible
>
> (later)
>
> Using 3 MPI threads
> Using 4 OpenMP threads per tMPI thread
> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
>   PP:0,PP:1,PP:2
> Pinning threads with an auto-selected logical core stride of 1
> System total charge: 0.000
> Will do PME sum in reciprocal space for electrostatic interactions.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.