[gmx-users] nvidia tesla p100

Mon Oct 31 22:36:14 CET 2016

Hi,

On Mon, Oct 31, 2016 at 10:09 PM Irem Altan <irem.altan at duke.edu> wrote:

> Hi,
>
>
> So that would be 32 total cores, which with hyperthreading might be 64
> threads?
>
> Yes, but hyperthreading is probably off, as it detects 32 logical cores.
>

OK, but do check your docs. Hyperthreading is probably useful for GROMACS,
but YMMV and we've not tested such setups extensively yet.

> Ignore this, it reports something meaningful, but that thing is not what it
> says. It's been removed in 2016 until someone works out a good way to say
> something useful. But it probably means you're getting the layout I
> suggested you should prefer.
>
> Oh, ok, so this is not a problem and Gromacs is in fact utilizing all the
> cores?
>

Well GROMACS can't have been if you were running 2 ranks with two threads
each. But this error message is based on data that is actually reporting
something else entirely.

Running on 1 node with total 32 logical cores, 2 compatible GPUs
> Hardware detected on host gpu047.pvt.bridges.psc.edu<
> http://gpu047.pvt.bridges.psc.edu> (the node of MPI
> rank 0):
>  CPU info:
>    Vendor: GenuineIntel
>    Brand:  Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
>    SIMD instructions most likely to fit this hardware: AVX2_256
>    SIMD instructions selected at GROMACS compile time: AVX2_256
>  GPU info:
>    Number of GPUs detected: 2
>    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>    #1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> compatible
>
> Reading file npt.tpr, VERSION 5.1.2 (single precision)
> Changing nstlist from 20 to 40, rlist from 1.017 to 1.073
>
> Using 2 MPI processes
> Using 2 OpenMP threads per MPI process
>
>
> This can't be from mpirun -np $SLURM_NPROCS gmx_mpi mdrun with the value
> 32. (Unless you're actually running a multi-simulation and we don't know
> about that).
>
> Should it have said “Using 32 MPI processes”?
>

It usually just reports what would have been given with mpirun -np, except
the case I mentioned.

What is a multi-simulation?

mdrun -multi lets you run multiple .tprs at the same time, which can elp
maximize throughput (or permit fancy algorithms that leverage the fact).
Never mind!

I’m running a single NPT simulation to prepare for umbrella sampling. The
> following is my submission script, maybe it will provide more insight:
>
> #!/bin/bash
> #SBATCH -N 1 --tasks-per-node=2
> #SBATCH -t 00:30:00
> #SBATCH -p GPU_100-debug --gres=gpu:2
>
> # Setup the module command
> set echo
> set -x
>
> module load gromacs/5.1.2
>
> cd $SLURM_SUBMIT_DIR
> echo "$SLURM_NPROCS=" $SLURM_NPROCS
>
> mpirun -np $SLURM_NPROCS gmx_mpi mdrun -ntomp 2 -v -deffnm npt
>
> Judging by the result of the echo command, it seems to detect two
> processors:
>
> + echo 2= 2
> 2= 2
>
> I mean, it does have two CPUs, with 16 cores each, so maybe that’s the
> problem?
>

There's no detection in any of this. You chose a single node and two tasks
per node, so you're getting what you asked for. That's just not a probably
not good thing to ask for.

To check, I’ve tried the following command:
>
> mpirun -np 32 gmx_mpi mdrun -ntomp 2 -v -deffnm npt
>
> which results in:
>
> Using 32 MPI processes
> Using 2 OpenMP threads per MPI process
>
>
> WARNING: Oversubscribing the available 32 logical CPU cores with 64
> threads.
>          This will cause considerable performance loss!
>

Yep. Now you're using software threads, which are much worse than hardware
threads.

On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2
> compatible GPUs are present, with IDs 0,1
> On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2
> GPUs auto-selected for this run.
> Mapping of GPU IDs to the 32 PP ranks in this node:
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>
>
> With this setup the simulation slows down by 10-fold.
>
> If, instead, I do
>
> mpirun -np 32 gmx_mpi mdrun -ntomp 1 -v -deffnm npt
>
> I get:
>
> Using 32 MPI processes
> Using 1 OpenMP thread per MPI process
>
> On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2
> compatible GPUs are present, with IDs 0,1
> On host gpu047.pvt.bridges.psc.edu<http://gpu047.pvt.bridges.psc.edu> 2
> GPUs auto-selected for this run.
> Mapping of GPU IDs to the 32 PP ranks in this node:
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>
>
> NOTE: Your choice of number of MPI ranks and amount of resources results
> in using 1 OpenMP threads per rank, which is most likely inefficient. The
> optimum is usually between 2 and 6 threads per rank.
>
> and the simulation slows down by ~9-fold instead.
>

So that's probably related to the thing that that error message is actually
reporting, which is the range of hardware cores on which each thread might
run. See background at
http://manual.gromacs.org/documentation/2016.1/user-guide/mdrun-performance.html.
If they're allowed to move all over the place, then the memory cache is
trashed. Since MPI libraries tend to set these in response to job
schedulers and users, by default mdrun respects affinity masks if set. A
quick test is

mpirun -np 32 gmx_mpi mdrun -ntomp 1 -v -deffnm npt -pin on

which directs mdrun to do something we think is good, rather than you
working out how to do things with SLURM+MPI. Should be a dramatic
improvement, but the hint about using fewer ranks to get more threads per
rank is probably better still.

Mark

> Best,
> Irem
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.