[gmx-users] cudaMallocHost filed: unknown error

Fri Mar 23 21:26:49 CET 2018

Hello,

I am running gromacs 5.1.2 on single nodes where the run is set to use 32 cores and 4 GPUs. The run command is:

mpirun -np 32 gmx_mpi mdrun -deffnm MD -maxh $maxh -dd 4 4 2 -npme 0 -gpu_id 00000000111111112222222233333333 -ntomp 1 -notunepme

Some of my runs die with this error:
cudaMallocHost of size 1024128 bytes failed: unknown error

Below is the relevant part of the .log file. Searching the internet didn't turn up any solutions. I'll contact sysadmins if you think this is likely some problem with the hardware or rogue jobs. In my testing, a collection of 24 jobs had 6 die with this same error message (including the "1024128 bytes" and "pmalloc_cuda.cu, line: 70"). All on different nodes, and all those node next took repeat jobs that run fine. When the error occured, it was always right at the start of the run.

Thank you for your help,
Chris.

Command line:
  gmx_mpi mdrun -deffnm MD -maxh 0.9 -dd 4 4 2 -npme 0 -gpu_id 00000000111111112222222233333333 -ntomp 1 -notunepme

Number of logical cores detected (72) does not match the number reported by OpenMP (2).
Consider setting the launch configuration manually!

Running on 1 node with total 36 cores, 72 logical cores, 4 compatible GPUs
Hardware detected on host ko026.localdomain (the node of MPI rank 0):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 4
    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
    #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
    #2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
    #3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible

Reading file MD.tpr, VERSION 5.1.2 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 32 MPI processes
Using 1 OpenMP thread per MPI process

On host ko026.localdomain 4 GPUs user-selected for this run.
Mapping of GPU IDs to the 32 PP ranks in this node: 0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3

NOTE: You assigned GPUs to multiple MPI processes.

NOTE: Your choice of number of MPI ranks and amount of resources results in using 1 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

NOTE: GROMACS was configured without NVML support hence it can not exploit
      application clocks of the detected Tesla P100-PCIE-16GB GPU to improve performance.
      Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

-------------------------------------------------------
Program gmx mdrun, VERSION 5.1.2
Source code file: /net/scratch3/cneale/exe/KODIAK/GROMACS/source/gromacs-5.1.2/src/gromacs/gmxlib/cuda_tools/pmalloc_cuda.cu, line: 70

Fatal error:
cudaMallocHost of size 1024128 bytes failed: unknown error

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Halting parallel program gmx mdrun on rank 31 out of 32
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 31