[gmx-users] Two strange problems when using GPU on multiple nodes

Tue Jun 30 22:38:29 CEST 2015

Dear all,
The cluster I am currently using has 24 CPU cores and 4 GPU cores per node and I am attempting to use 2 nodes for my simulation. While my sbatch script can be executed well, I have found two strange problems and I am a little confused about it. Here is my sbatch script:

#!/bin/bash
#SBATCH --job-name="zangtw"
#SBATCH --output="zangtw.%j.%N.out"
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --export=ALL
#SBATCH -t 47:30:00

ibrun --npernode 4 mdrun -ntomp 6 -s run.tpr -pin on -gpu_id 0123

and here are the descriptions of my problems:
1. My simulation stops every 10+ hours. Specifically, the job is still “running” in the queue but md.log/traj.trr stop updating. Is it due to the lack of memory? I have never met this before when I ran pure MPI, pure OpenMP, or even hybrid MPI/OpenMP(without GPU) jobs. 

2. I am attempting to use 8 GPUs, but only 4 GPUs are displayed in my log file (as follow):

Using 8 MPI processes
Using 6 OpenMP threads per MPI process
4 GPUs detected on host comet-30-06.sdsc.edu:
#0: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
#1: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
#2: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
#3: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible

Does it mean that only 4 GPUs in the first node are used while 4 GPUs in the second node are idle, or 8 GPUs are used in reality but only 4 of them are displayed in the log file? In the second case, what should I do if I want to grep the information of the other 4 GPUs?

Thank you guys so much!

Regards,
Mark 

—
Sent from Mailbox