[gmx-users] nvidia tesla p100

Mon Oct 31 22:09:38 CET 2016

Hi,

So that would be 32 total cores, which with hyperthreading might be 64
threads?

Yes, but hyperthreading is probably off, as it detects 32 logical cores.

Ignore this, it reports something meaningful, but that thing is not what it
says. It's been removed in 2016 until someone works out a good way to say
something useful. But it probably means you're getting the layout I
suggested you should prefer.

Oh, ok, so this is not a problem and Gromacs is in fact utilizing all the cores?

Running on 1 node with total 32 logical cores, 2 compatible GPUs
Hardware detected on host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> (the node of MPI
rank 0):
 CPU info:
   Vendor: GenuineIntel
   Brand:  Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
   SIMD instructions most likely to fit this hardware: AVX2_256
   SIMD instructions selected at GROMACS compile time: AVX2_256
 GPU info:
   Number of GPUs detected: 2
   #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible
   #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
compatible

Reading file npt.tpr, VERSION 5.1.2 (single precision)
Changing nstlist from 20 to 40, rlist from 1.017 to 1.073

Using 2 MPI processes
Using 2 OpenMP threads per MPI process

This can't be from mpirun -np $SLURM_NPROCS gmx_mpi mdrun with the value
32. (Unless you're actually running a multi-simulation and we don't know
about that).

Should it have said “Using 32 MPI processes”?

What is a multi-simulation? I’m running a single NPT simulation to prepare for umbrella sampling. The following is my submission script, maybe it will provide more insight:

#!/bin/bash
#SBATCH -N 1 --tasks-per-node=2
#SBATCH -t 00:30:00
#SBATCH -p GPU_100-debug --gres=gpu:2

# Setup the module command
set echo
set -x

module load gromacs/5.1.2

cd $SLURM_SUBMIT_DIR
echo "$SLURM_NPROCS=" $SLURM_NPROCS

mpirun -np $SLURM_NPROCS gmx_mpi mdrun -ntomp 2 -v -deffnm npt

Judging by the result of the echo command, it seems to detect two processors:

+ echo 2= 2
2= 2

I mean, it does have two CPUs, with 16 cores each, so maybe that’s the problem?

To check, I’ve tried the following command:

mpirun -np 32 gmx_mpi mdrun -ntomp 2 -v -deffnm npt

which results in:

Using 32 MPI processes
Using 2 OpenMP threads per MPI process

WARNING: Oversubscribing the available 32 logical CPU cores with 64 threads.
         This will cause considerable performance loss!
On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2 compatible GPUs are present, with IDs 0,1
On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 32 PP ranks in this node: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

With this setup the simulation slows down by 10-fold.

If, instead, I do

mpirun -np 32 gmx_mpi mdrun -ntomp 1 -v -deffnm npt

I get:

Using 32 MPI processes
Using 1 OpenMP thread per MPI process

On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2 compatible GPUs are present, with IDs 0,1
On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 32 PP ranks in this node: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

NOTE: Your choice of number of MPI ranks and amount of resources results in using 1 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

and the simulation slows down by ~9-fold instead.

Best,
Irem