[gmx-users] frequent stop of GPU jobs at specific simulation steps.
Zhongjin He
hzj1000 at 163.com
Mon Sep 28 05:34:48 CEST 2015
Dear GMX users,
Recently I encountered a strange problem in using GPU GROMACS 5.0.6. The GUP node include 24 CPU cores and 1 Nvidia Tesla K40M card, the details of the node are below:
1 x GPU-Node (Supermicro 2U Server equivalent or better) 01 16,575.00
- 2 x 12-Core Intel E5-2670v3 Processor
- 64GB RAM (4x16GB DDR4-2133 REG ECC)
- 2 x 1TB 7200RPM SATA 6Gbps Enterprise HDD
- Integrated 1GbE ports
- 1 x Nvidia Tesla K40M 12Gb GDDR5 GPU Card
The GPU gromacs5.0.6 is installed by a computer engineer from a computer company who sells this GPU node to us. They install GPU GROMACS5.0.6 in this way,
This how we configured the cmake during compilation.
sudo cmake .. -DCMAKE_C_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicc -DCMAKE_CXX_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicxx -DGMX_MPI=on
-DGMX_GPU=on
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCMAKE_PREFIX_PATH=/data/apps/fftw-3.3.4-intel-sp/ -DGMX_FFT_LIBRARY=fftw3 -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs-5.0.6-gpu-sp
I use OpenMP rather than Mpirun to run a GPU job as it run faster. I use 12 CPUs and 1 GPU card to run a GPU job
module load gromacs-5.0.6-gpu-sp
export OMP_NUM_THREADS=12
mdrun_mpi -deffnm md -pin on -pinoffset 0
I have found such GPU job run about 10+ h and stopped at specific simulaiton steps. And The symptoms of this problem is similar to those described in http://permalink.gmane.org/gmane.science.biology.gromacs.user/77958.
> and here are the descriptions of my problems:
> 1. My simulation stops every 10+ hours. Specifically, the job is still
> “running” in the queue but md.log/traj.trr stop updating.That suggests the filesystem has gone AWOL, or filled, or got to a 2GB file
size limit, or such.
Use qstat to check the state of this GPU job, showing it is running. GPU job runs part on these 12 CPUs and partically on GPU card. BUT log to the GPU node, the CPU part of this job is missing, while GPU part is still show it is in it, but GPU-Util is 0%.
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:82:00.0 Off | 0 |
| N/A 34C P0 62W / 235W | 139MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28651 C mdrun_mpi 82MiB |
+-----------------------------------------------------------------------------+
Actually, On this GPU node, I have tested two systems. They stopped at specific simulation steps.
The first system is in /home/he/C1C2C3/C3_New, I have tested 3 times, the first is named as run1. After this job die, then I resubmit (NOT restart) it to run as run2. After run2 die, I resubmit the same job as run3. These 3 tests of the same simulation system show that after 10+ hours simulation, they died at the same simulation step. I use command mdrun_mpi -deffnm md -pin on -pinoffset 12 or mdrun_mpi -deffnm md
Run1
tail -n 10 /home/he/C1C2C3/C3_New/run1/md.log
Step Time Lambda
62245000 124490.00000 0.00000
Energies (kJ/mol)
Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
3.99377e+02 3.60666e+04 -2.97791e+03 -2.49639e+05 8.68339e+02
Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
-2.15283e+05 2.59468e+04 -1.89336e+05 2.49125e+02 -3.11737e+02
Pressure (bar) Constr. rmsd
5.47750e+02 1.02327e-06
Run2
tail -n 10 /home/he/C1C2C3/C3_New/run2/md.log
Step Time Lambda
62245000 124490.00000 0.00000
Energies (kJ/mol)
Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
4.24556e+02 3.59301e+04 -2.98718e+03 -2.50283e+05 8.91892e+02
Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
-2.16023e+05 2.57250e+04 -1.90298e+05 2.46995e+02 -3.13679e+02
Pressure (bar) Constr. rmsd
5.07704e+02 9.80011e-07
Run3
tail -n 10 /home/he/C1C2C3/C3_New/md.log
Step Time Lambda
62245000 124490.00000 0.00000
Energies (kJ/mol)
Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
4.58207e+02 3.55527e+04 -3.00135e+03 -2.48904e+05 8.72284e+02
Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
-2.15022e+05 2.60553e+04 -1.88967e+05 2.50167e+02 -3.16659e+02
Pressure (bar) Constr. rmsd
6.59141e+02 1.04033e-06
Similar problem was also found in the second simulation system in /home/he/C1C2C3/PureCO2TraPPE. I use command mdrun_mpi -deffnm md -pin on -pinoffset 0
Run1
tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/run1/md.log
Step Time Lambda
65470000 130940.00000 0.00000
Energies (kJ/mol)
LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
2.80567e+04 -1.99950e+03 -2.05836e+05 1.06692e+03 -1.78712e+05
Kinetic En. Total Energy Temperature Pres. DC (bar) Pressure (bar)
2.09106e+04 -1.57801e+05 2.48747e+02 -2.91211e+02 4.97676e+02
Constr. rmsd
6.64528e-07
Run2
tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/md.log
Step Time Lambda
65470000 130940.00000 0.00000
Energies (kJ/mol)
LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
2.78408e+04 -2.00737e+03 -2.05654e+05 1.07244e+03 -1.78748e+05
Kinetic En. Total Energy Temperature Pres. DC (bar) Pressure (bar)
2.09921e+04 -1.57756e+05 2.49716e+02 -2.93505e+02 5.29104e+02
Constr. rmsd
7.43034e-07
Hi GMX users, did anybody enconter such a problem with GPU Gromacs, please give me some advise to solve this problem. Thanks!
Best regards,
Zhongjin HE
More information about the gromacs.org_gmx-users
mailing list