[gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs

Szilárd Páll szilard.pall at cbr.su.se
Thu Apr 25 16:59:53 CEST 2013


Hi,

You should really check out the documentation on how to use mdrun 4.6:
http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Running_simulations

Brief summary: when running on GPUs every domain is assigned to a set
of CPU cores and a GPU, hence you need to start as many PP MPI ranks
per node as the number of GPUs (or pass a PP-GPU mapping manually).


Now, there are some slight complications with the inconvenient
hardware setup of the machines you are using. When the number of cores
is not divisible by the number of GPUs, you'll end up wasting cores.
In your case only 3*5=15 cores per compute node will be used. What
will make things even worse, unless you use "-pin on" (which is the
default behavior *only* if you use all cores in a node), is that mdrun
will not lock threads to cores and will let them be moved around by
the OS which can cause severe performance degradation .

However, you can actually work around these issues and get good
performance by using separate PME ranks. You can just try using 3 PP +
1 PME per compute node with four OpenMP threads each, i.e:
mpirun -np 4*Nnodes mpirun_mpi -npme 1 -ntomp 4
If you are lucky with the PP/PME load, this should work well and even
if you get some PP-PME imbalance, this should hurt performance way
less than the inconvenient 3x5 threads setup.

Cheers,
--
Szilárd


On Wed, Apr 24, 2013 at 7:08 PM, Christopher Neale
<chris.neale at mail.utoronto.ca> wrote:
> Dear Users:
>
> I am having trouble getting any speedup by using more than one node,
> where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1.
>
> I saw this post, indicating that the .log file output about number of gpus used might not be accurate:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
>
> Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 ns/day on 3 nodes.
> Somehow I think I have not configures the mpirun -np and mdrun -ntomp correctly
> (although I have tried numerous combinations).
>
> On 1 node, I can just run mdrun without mpirun like this:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
>
> For that run, the top of the .log file is:
> Log file opened on Wed Apr 24 11:36:53 2013
> Host: kfs179  pid: 59561  nodeid: 0  nnodes:  1
> Gromacs version:    VERSION 4.6.1
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled
> GPU support:        enabled
> invsqrt routine:    gmx_software_invsqrt(x)
> CPU acceleration:   AVX_256
> FFT library:        fftw-3.3.3-sse2
> Large file support: enabled
> RDTSCP usage:       enabled
> Built on:           Tue Apr 23 12:59:48 EDT 2013
> Built by:           cneale at kfslogin2.nics.utk.edu [CMAKE]
> Build OS/arch:      Linux 2.6.32-220.4.1.el6.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc Intel icc (ICC) 12.1.5 20120612
> C compiler flags:   -mavx   -std=gnu99 -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> C++ compiler:       /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc Intel icpc (ICC) 12.1.5 20120612
> C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda compilation tools, release 4.2, V0.2.1221
> CUDA driver:        5.0
> CUDA runtime:       4.20
> ...
> <snip>
> ...
> Initializing Domain Decomposition on 3 nodes
> Dynamic load balancing: yes
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>     two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
>   multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
> Minimum cell size due to bonded interactions: 0.475 nm
> Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
> Estimated maximum distance required for P-LINCS: 1.175 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME nodes, per user request
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm
> The maximum allowed number of cells is: X 5 Y 5 Z 6
> Domain decomposition grid 3 x 1 x 1, separate PME nodes 0
> PME domain decomposition: 3 x 1 x 1
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using 3 MPI threads
> Using 5 OpenMP threads per tMPI thread
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
>
>
> 3 GPUs detected:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>
> 3 GPUs auto-selected for this run: #0, #1, #2
>
> Will do PME sum in reciprocal space.
> ...
> <snip>
> ...
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
> -----------------------------------------------------------------------------
>  Domain decomp.         3    5       4380      23.714      922.574     6.7
>  DD comm. load          3    5       4379       0.054        2.114     0.0
>  DD comm. bounds        3    5       4381       0.056        2.193     0.0
>  Neighbor search        3    5       4380      11.325      440.581     3.2
>  Launch GPU ops.        3    5      87582       3.970      154.455     1.1
>  Comm. coord.           3    5      39411       2.522       98.132     0.7
>  Force                  3    5      43791      55.351     2153.409    15.5
>  Wait + Comm. F         3    5      43791       2.800      108.930     0.8
>  PME mesh               3    5      43791      97.377     3788.427    27.3
>  Wait GPU nonlocal      3    5      43791       0.027        1.046     0.0
>  Wait GPU local         3    5      43791       0.009        0.347     0.0
>  NB X/F buffer ops.     3    5     166404       3.426      133.276     1.0
>  Write traj.            3    5          2       0.028        1.087     0.0
>  Update                 3    5      43791      73.140     2845.491    20.5
>  Constraints            3    5      87582      65.339     2541.981    18.3
>  Comm. energies         3    5       4380       0.102        3.955     0.0
>  Rest                   3                      17.332      674.286     4.9
> -----------------------------------------------------------------------------
>  Total                  3                     356.572    13872.284   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME redist. X/F        3    5      87582      10.668      415.017     3.0
>  PME spread/gather      3    5      87582      44.767     1741.641    12.6
>  PME 3D-FFT             3    5      87582      26.979     1049.617     7.6
>  PME 3D-FFT Comm.       3    5      87582      11.085      431.273     3.1
>  PME solve              3    5      43791       3.705      144.139     1.0
> -----------------------------------------------------------------------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:     5341.770      356.572     1498.1
>                  (ns/day)    (hour/ns)
> Performance:       21.222        1.131
> Finished mdrun on node 0 Wed Apr 24 11:42:50 2013
>
>
>
> ###########################################################################################
>
> For my MPI run, I ran on a single node like this:
>
> mpirun -np 1 /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi -notunepme -deffnm md3 -dlb yes -npme -1 -cpt 60 -maxh 0.1 -cpi md3.cpt
>
> And the top of the log is the same, except:
> MPI library:        MPI
> ...
> <snip>
> ...
> Using 1 MPI process
> Using 16 OpenMP threads
> ...
>
> To run on 2 nodes, I got errors if I did not specify mpirun -np:
>
> Using 24 MPI processes
> Using 1 OpenMP thread per MPI process
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
>
>
> 3 GPUs detected on host kfs179:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.1
> Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
>
> Fatal error:
> Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
> mdrun_mpi was started with 12 PP MPI processes per node, but only 3 GPUs were detected.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> Thanx for Using GROMACS - Have a Nice Day
>
>
> #######################
>
> So I tried lots of different mpirun -np options, but only -np 2 and -np 3 worked; i.e., it worked when gromacs did:
>
> Using 2 MPI processes
> Using 8 OpenMP threads per MPI process
>
> or
>
> Using 3 MPI processes
> Using 5 OpenMP threads per MPI process
>
> but -np 4, 6, and 32 all failed.
>
> For example, when I use mpirun -np 32, I get
>
> Using 32 MPI processes
> Using 1 OpenMP thread per MPI process
>
> WARNING: On node 0: oversubscribing the available 16 logical CPU cores per node with 20 MPI processes.
>          This will cause considerable performance loss!
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
>
>
> 3 GPUs detected on host kfs179:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.1
> Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
>
> Fatal error:
> Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
> mdrun_mpi was started with 20 PP MPI processes per node, but only 3 GPUs were detected.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
>
> ###########
>
> All of this makes me think that only 1 node is being picked up. I suppose that it is possibly my fault with
> submission, etc. since this is a new cluster to me, but my PBS script asks for 2 nodes and showq reports that
> 2 nodes were allocated when it is running.
>
> #PBS -l walltime=00:10:00,nodes=2:ppn=12:gpus=3:shared
>
> $ showq |grep cneale
> 288686               cneale    Running    32    00:09:53  Wed Apr 24 12:42:10
>
>
> Thank you,
> Chris.
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists



More information about the gromacs.org_gmx-users mailing list