[gmx-users] Running gmx-4.6.x over multiple homogeneous nodes with GPU acceleration

Wed Jun 5 15:33:56 CEST 2013

On Wed, Jun 5, 2013 at 2:53 PM, João Henriques <
joao.henriques.32353 at gmail.com> wrote:

> Sorry to keep bugging you guys, but even after considering all you
> suggested and reading the bugzilla thread Mark pointed out, I'm still
> unable to make the simulation run over multiple nodes.
> *Here is a template of a simple submission over 2 nodes:*
>
> --- START ---
> #!/bin/sh
> #
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> #
> # Job name
> #SBATCH -J md
> #
> # No. of nodes and no. of processors per node
> #SBATCH -N 2
> #SBATCH --exclusive
> #
> # Time needed to complete the job
> #SBATCH -t 48:00:00
> #
> # Add modules
> module load gcc/4.6.3
> module load openmpi/1.6.3/gcc/4.6.3
> module load cuda/5.0
> module load gromacs/4.6
> #
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> #
> grompp -f md.mdp -c npt.gro -t npt.cpt -p topol -o md.tpr
> mpirun -np 4 mdrun_mpi -gpu_id 01 -deffnm md -v
> #
> # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> --- END ---
>
> *Here is an extract of the md.log:*
>
> --- START ---
> Using 4 MPI processes
> Using 4 OpenMP threads per MPI process
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
> pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
>
>
> 2 GPUs detected on host en001:
>   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6
> Source code file:
> /lunarc/sw/erik/src/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c,
> line: 322
>
> Fatal error:
> Incorrect launch configuration: mismatching number of PP MPI processes and
> GPUs per node.
>

"per node" is critical here.

> mdrun_mpi was started with 4 PP MPI processes per node, but you provided 2
> GPUs.
>

...and here. As far as mdrun_mpi knows from the MPI system there's only MPI
ranks on this one node.

For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> --- END ---
>
> As you can see, gmx is having trouble understanding that there's a second
> node available. Note that since I did not specify -ntomp, it assigned 4
> threads to each of the 4 mpi processes (filling the entire avail. 16 CPUs
> *on
> one node*).
> For the same exact submission, if I do set "-ntomp 8" (since I have 4 MPI
> procs * 8 OpenMP threads = 32 CPUs total on the 2 nodes) I get a warning
> telling me that I'm hyperthreading, which can only mean that *gmx is
> assigning all processes to the first node once again.*
> Am I doing something wrong or is there some problem with gmx-4.6? I guess
> it can only be my fault, since I've never seen anyone else complaining
> about the same issue here.
>

Assigning MPI processes to nodes is a matter configuring your MPI. GROMACS
just follows the MPI system information it gets from MPI - hence the
oversubscription. If you assign two MPI processes to each node, then things
should work.

Mark