[gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs

Thu Apr 25 10:02:52 CEST 2013

Hi,

You're using thread-MPI, but you should compile with MPI. And then start as many processes as total GPUs.

Cheers,

Berk

> From: chris.neale at mail.utoronto.ca
> To: gmx-users at gromacs.org
> Date: Wed, 24 Apr 2013 17:08:28 +0000
> Subject: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs
> 
> Dear Users:
> 
> I am having trouble getting any speedup by using more than one node, 
> where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1.
> 
> I saw this post, indicating that the .log file output about number of gpus used might not be accurate:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
> 
> Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 ns/day on 3 nodes. 
> Somehow I think I have not configures the mpirun -np and mdrun -ntomp correctly 
> (although I have tried numerous combinations).
> 
> On 1 node, I can just run mdrun without mpirun like this:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
> 
> For that run, the top of the .log file is:
> Log file opened on Wed Apr 24 11:36:53 2013
> Host: kfs179  pid: 59561  nodeid: 0  nnodes:  1
> Gromacs version:    VERSION 4.6.1
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled
> GPU support:        enabled
> invsqrt routine:    gmx_software_invsqrt(x)
> CPU acceleration:   AVX_256
> FFT library:        fftw-3.3.3-sse2
> Large file support: enabled
> RDTSCP usage:       enabled
> Built on:           Tue Apr 23 12:59:48 EDT 2013
> Built by:           cneale at kfslogin2.nics.utk.edu [CMAKE]
> Build OS/arch:      Linux 2.6.32-220.4.1.el6.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler:         /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc Intel icc (ICC) 12.1.5 20120612
> C compiler flags:   -mavx   -std=gnu99 -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> C++ compiler:       /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc Intel icpc (ICC) 12.1.5 20120612
> C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda compilation tools, release 4.2, V0.2.1221
> CUDA driver:        5.0
> CUDA runtime:       4.20
> ...
> <snip>
> ...
> Initializing Domain Decomposition on 3 nodes
> Dynamic load balancing: yes
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>     two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
>   multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
> Minimum cell size due to bonded interactions: 0.475 nm
> Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
> Estimated maximum distance required for P-LINCS: 1.175 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME nodes, per user request
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm
> The maximum allowed number of cells is: X 5 Y 5 Z 6
> Domain decomposition grid 3 x 1 x 1, separate PME nodes 0
> PME domain decomposition: 3 x 1 x 1
> Domain decomposition nodeid 0, coordinates 0 0 0
> 
> Using 3 MPI threads
> Using 5 OpenMP threads per tMPI thread
> 
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
> 
> 
> 3 GPUs detected:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
> 
> 3 GPUs auto-selected for this run: #0, #1, #2
> 
> Will do PME sum in reciprocal space.
> ...
> <snip>
> ...
> 
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
> -----------------------------------------------------------------------------
>  Domain decomp.         3    5       4380      23.714      922.574     6.7
>  DD comm. load          3    5       4379       0.054        2.114     0.0
>  DD comm. bounds        3    5       4381       0.056        2.193     0.0
>  Neighbor search        3    5       4380      11.325      440.581     3.2
>  Launch GPU ops.        3    5      87582       3.970      154.455     1.1
>  Comm. coord.           3    5      39411       2.522       98.132     0.7
>  Force                  3    5      43791      55.351     2153.409    15.5
>  Wait + Comm. F         3    5      43791       2.800      108.930     0.8
>  PME mesh               3    5      43791      97.377     3788.427    27.3
>  Wait GPU nonlocal      3    5      43791       0.027        1.046     0.0
>  Wait GPU local         3    5      43791       0.009        0.347     0.0
>  NB X/F buffer ops.     3    5     166404       3.426      133.276     1.0
>  Write traj.            3    5          2       0.028        1.087     0.0
>  Update                 3    5      43791      73.140     2845.491    20.5
>  Constraints            3    5      87582      65.339     2541.981    18.3
>  Comm. energies         3    5       4380       0.102        3.955     0.0
>  Rest                   3                      17.332      674.286     4.9
> -----------------------------------------------------------------------------
>  Total                  3                     356.572    13872.284   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME redist. X/F        3    5      87582      10.668      415.017     3.0
>  PME spread/gather      3    5      87582      44.767     1741.641    12.6
>  PME 3D-FFT             3    5      87582      26.979     1049.617     7.6
>  PME 3D-FFT Comm.       3    5      87582      11.085      431.273     3.1
>  PME solve              3    5      43791       3.705      144.139     1.0
> -----------------------------------------------------------------------------
> 
>                Core t (s)   Wall t (s)        (%)
>        Time:     5341.770      356.572     1498.1
>                  (ns/day)    (hour/ns)
> Performance:       21.222        1.131
> Finished mdrun on node 0 Wed Apr 24 11:42:50 2013
> 
> 
> 
> ###########################################################################################
> 
> For my MPI run, I ran on a single node like this:
> 
> mpirun -np 1 /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi -notunepme -deffnm md3 -dlb yes -npme -1 -cpt 60 -maxh 0.1 -cpi md3.cpt
> 
> And the top of the log is the same, except:
> MPI library:        MPI
> ...
> <snip>
> ...
> Using 1 MPI process
> Using 16 OpenMP threads
> ...
> 
> To run on 2 nodes, I got errors if I did not specify mpirun -np:
> 
> Using 24 MPI processes
> Using 1 OpenMP thread per MPI process
> 
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
> 
> 
> 3 GPUs detected on host kfs179:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
> 
> 
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.1
> Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
> 
> Fatal error:
> Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
> mdrun_mpi was started with 12 PP MPI processes per node, but only 3 GPUs were detected.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> 
> Thanx for Using GROMACS - Have a Nice Day
> 
> 
> #######################
> 
> So I tried lots of different mpirun -np options, but only -np 2 and -np 3 worked; i.e., it worked when gromacs did:
> 
> Using 2 MPI processes
> Using 8 OpenMP threads per MPI process
> 
> or 
> 
> Using 3 MPI processes
> Using 5 OpenMP threads per MPI process
> 
> but -np 4, 6, and 32 all failed.
> 
> For example, when I use mpirun -np 32, I get
> 
> Using 32 MPI processes
> Using 1 OpenMP thread per MPI process
> 
> WARNING: On node 0: oversubscribing the available 16 logical CPU cores per node with 20 MPI processes.
>          This will cause considerable performance loss!
> 
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
> 
> 
> 3 GPUs detected on host kfs179:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
> 
> 
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.1
> Source code file: /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, line: 356
> 
> Fatal error:
> Incorrect launch configuration: mismatching number of PP MPI processes and GPUs per node.
> mdrun_mpi was started with 20 PP MPI processes per node, but only 3 GPUs were detected.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> 
> 
> ###########
> 
> All of this makes me think that only 1 node is being picked up. I suppose that it is possibly my fault with 
> submission, etc. since this is a new cluster to me, but my PBS script asks for 2 nodes and showq reports that 
> 2 nodes were allocated when it is running.
> 
> #PBS -l walltime=00:10:00,nodes=2:ppn=12:gpus=3:shared
> 
> $ showq |grep cneale
> 288686               cneale    Running    32    00:09:53  Wed Apr 24 12:42:10
> 
> 
> Thank you,
> Chris.
> -- 
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists