[gmx-users] frequent stop of GPU jobs at specific simulation steps.

Mon Sep 28 05:34:48 CEST 2015

Dear GMX users,

Recently I encountered a strange problem in using GPU GROMACS 5.0.6.  The GUP node include 24 CPU cores and 1 Nvidia Tesla K40M card, the details of the node are below:
1 x GPU-Node (Supermicro 2U Server equivalent or better) 01 16,575.00
- 2 x 12-Core Intel E5-2670v3 Processor
- 64GB RAM (4x16GB DDR4-2133 REG ECC)
- 2 x 1TB 7200RPM SATA 6Gbps Enterprise HDD
- Integrated 1GbE ports
- 1 x Nvidia Tesla K40M 12Gb GDDR5 GPU  Card

The GPU gromacs5.0.6 is installed by a computer engineer from a computer company who sells this GPU node to us. They install GPU GROMACS5.0.6 in this way,
This how we configured the cmake during compilation.
sudo cmake .. -DCMAKE_C_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicc -DCMAKE_CXX_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicxx -DGMX_MPI=on
-DGMX_GPU=on
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCMAKE_PREFIX_PATH=/data/apps/fftw-3.3.4-intel-sp/ -DGMX_FFT_LIBRARY=fftw3 -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs-5.0.6-gpu-sp

I use OpenMP rather than Mpirun to run a GPU job as it run faster. I use 12 CPUs and 1 GPU card to run a GPU job
module load gromacs-5.0.6-gpu-sp
export OMP_NUM_THREADS=12
mdrun_mpi -deffnm md -pin on -pinoffset 0

I have found such GPU job run about 10+ h and stopped at specific simulaiton steps. And The symptoms of this problem is similar to those described in  http://permalink.gmane.org/gmane.science.biology.gromacs.user/77958.   
> and here are the descriptions of my problems:
> 1. My simulation stops every 10+ hours. Specifically, the job is still
> “running” in the queue but md.log/traj.trr stop updating.That suggests the filesystem has gone AWOL, or filled, or got to a 2GB file
size limit, or such.

Use qstat to check the state of this GPU job, showing it is running. GPU job runs part on these 12 CPUs and partically on GPU card. BUT log to the GPU node, the CPU part of this job is missing, while GPU part is still show it is in it, but  GPU-Util  is 0%.
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          Off  | 0000:82:00.0     Off |                    0 |
| N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     28651    C   mdrun_mpi                                       82MiB |
+-----------------------------------------------------------------------------+

Actually, On this GPU node, I have tested two systems.  They stopped at specific simulation steps.
The first system is in /home/he/C1C2C3/C3_New, I have tested 3 times, the first is named as  run1. After this job die, then I resubmit (NOT restart)  it to run as run2. After run2 die, I resubmit the same job as run3.   These 3  tests of the same simulation system show that after 10+ hours simulation, they died at the same simulation step.  I use command                   mdrun_mpi -deffnm md -pin on -pinoffset 12   or mdrun_mpi -deffnm md 
Run1
tail -n 10 /home/he/C1C2C3/C3_New/run1/md.log
           Step           Time         Lambda
       62245000   124490.00000        0.00000

   Energies (kJ/mol)
          Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    3.99377e+02    3.60666e+04   -2.97791e+03   -2.49639e+05    8.68339e+02
      Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
   -2.15283e+05    2.59468e+04   -1.89336e+05    2.49125e+02   -3.11737e+02
 Pressure (bar)   Constr. rmsd
    5.47750e+02    1.02327e-06

Run2
tail -n 10 /home/he/C1C2C3/C3_New/run2/md.log
           Step           Time         Lambda
       62245000   124490.00000        0.00000

   Energies (kJ/mol)
          Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    4.24556e+02    3.59301e+04   -2.98718e+03   -2.50283e+05    8.91892e+02
      Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
   -2.16023e+05    2.57250e+04   -1.90298e+05    2.46995e+02   -3.13679e+02
 Pressure (bar)   Constr. rmsd
    5.07704e+02    9.80011e-07

Run3
tail -n 10 /home/he/C1C2C3/C3_New/md.log
           Step           Time         Lambda
       62245000   124490.00000        0.00000

   Energies (kJ/mol)
          Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
    4.58207e+02    3.55527e+04   -3.00135e+03   -2.48904e+05    8.72284e+02
      Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
   -2.15022e+05    2.60553e+04   -1.88967e+05    2.50167e+02   -3.16659e+02
 Pressure (bar)   Constr. rmsd
    6.59141e+02    1.04033e-06

Similar problem was also found in the second simulation system in /home/he/C1C2C3/PureCO2TraPPE. I use command      mdrun_mpi -deffnm md -pin on -pinoffset 0
Run1
tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/run1/md.log
           Step           Time         Lambda
       65470000   130940.00000        0.00000

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    2.80567e+04   -1.99950e+03   -2.05836e+05    1.06692e+03   -1.78712e+05
    Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
    2.09106e+04   -1.57801e+05    2.48747e+02   -2.91211e+02    4.97676e+02
   Constr. rmsd
    6.64528e-07
Run2
tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/md.log
           Step           Time         Lambda
       65470000   130940.00000        0.00000

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    2.78408e+04   -2.00737e+03   -2.05654e+05    1.07244e+03   -1.78748e+05
    Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
    2.09921e+04   -1.57756e+05    2.49716e+02   -2.93505e+02    5.29104e+02
   Constr. rmsd
    7.43034e-07

Hi GMX users, did anybody enconter such a problem with GPU Gromacs, please give me some advise to solve this problem. Thanks!

Best regards,

Zhongjin HE