[gmx-users] gpu cluster explanation

Fri Jul 12 17:41:27 CEST 2013


On 12/07/13 13:26, Francesco wrote:
> Hi all,
> I'm working with a 200K atoms system (protein + explicit water) and
> after a while using a cpu cluster I had to switch to a gpu cluster.
> I read both Acceleration and parallelization and Gromacs-gpu
> documentation pages
> (http://www.gromacs.org/Documentation/Acceleration_and_parallelization
> and
> http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
> but it's a bit confusing and I need help to understand if I really have
> understood correctly. :)
> I have 2 type of nodes:
> 3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
> 2.53GHz)
> 8gpu and 2 cpu (6 cores each)
>
> 1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
> MPI max.
> 2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
> 4x3= 12
>
> now if I have a node with 8 gpu, I can use 4 gpu:
> 4 MPI and 3 OpenMP
> is it right?
> is it possible to use 8 gpu and 8 cores only?

you could set -ntomp 0, however and setup mpi/thread mpi to use 8 cores. 
However, a system that unbalanced (huge amount of gpu power to 
comparatively little cpu power) is unlikely to get great performance.
>
> Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
> gpu  and 12 cores I get 9-11 ns/day.
>
That slowdown is in line with what I got when I tried a similar cpu-gpu 
setup. That said other's might have some advice that will improve your 
performance.

> the command that I use is:
> mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
> with n° gpu set via script :
> #BSUB -n 3
>
> I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.
>
> The mdp file and some statistics are following:
>
> -------- START MDP --------
>
> title             = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD
>
> ; Run parameters
> integrator              = md                    ; Algorithm options
> nsteps                  = 25000000      ; maximum number of steps to
> perform [50 ns]
> dt                      = 0.002         ; 2 fs = 0.002 ps
>
> ; Output control
> nstxout            = 10000     ; [steps] freq to write coordinates to
> trajectory, the last coordinates are always written
> nstvout            = 10000     ; [steps] freq to write velocities to
> trajectory, the last velocities are always written
> nstlog              = 10000     ; [steps] freq to write energies to log
> file, the last energies are always written
> nstenergy         = 10000          ; [steps] write energies to disk
> every nstenergy steps
> nstxtcout          = 10000     ; [steps] freq to write coordinates to
> xtc trajectory
> xtc_precision   = 1000          ; precision to write to xtc trajectory
> (1000 = default)
> xtc_grps                = system                ; which coordinate
> group(s) to write to disk
> energygrps      = system                ; or System / which energy
> group(s) to writk
>
> ; Bond parameters
> continuation    = yes                   ; restarting from npt
> constraints     = all-bonds     ; Bond types to replace by constraints
> constraint_algorithm    = lincs         ; holonomic constraints
> lincs_iter              = 1                     ; accuracy of LINCS
> lincs_order             = 4                     ; also related to
> accuracy
> lincs_warnangle  = 30        ; [degrees] maximum angle that a bond can
> rotate before LINCS will complain
>

That seems a little loose for constraints but setting that up and 
checking it's conserving energy and preserving bond lengths is something 
you'll have to do yourself

Richard
> ; Neighborsearching
> ns_type                 = grid      ; method of updating neighbor list
> cutoff-scheme     = Verlet
> nstlist                 = 10        ; [steps] frequence to update
> neighbor list (10)
> rlist                 = 1.0       ; [nm] cut-off distance for the
> short-range neighbor list  (1 default)
> rcoulomb          = 1.0       ; [nm] long range electrostatic cut-off
> rvdw              = 1.0       ; [nm]  long range Van der Waals cut-off
>
> ; Electrostatics
> coulombtype    = PME          ; treatment of long range electrostatic
> interactions
> vdwtype         = cut-off       ; treatment of Van der Waals
> interactions
>
> ; Periodic boundary conditions
> pbc                     = xyz
>
> ; Dispersion correction
> DispCorr                                = EnerPres      ; appling long
> range dispersion corrections
>
> ; Ewald
> fourierspacing    = 0.12                ; grid spacing for FFT  -
> controll the higest magnitude of wave vectors (0.12)
> pme_order         = 4         ; interpolation order for PME, 4 = cubic
> ewald_rtol        = 1e-5      ; relative strength of Ewald-shifted
> potential at rcoulomb
>
> ; Temperature coupling
> tcoupl          = nose-hoover                           ; temperature
> coupling with Nose-Hoover ensemble
> tc_grps         = Protein Non-Protein
> tau_t                   = 0.4        0.4                        ; [ps]
> time constant
> ref_t                   = 310        310                        ; [K]
> reference temperature for coupling [310 = 28°C
>
> ; Pressure coupling
> pcoupl          = parrinello-rahman
> pcoupltype        = isotropic                                   ;
> uniform scaling of box vect
> tau_p           = 2.0
> ; [ps] time constant
> ref_p                   = 1.0
>         ; [bar] reference pressure for coupling
> compressibility = 4.5e-5
> ; [bar^-1] isothermal compressibility of water
> refcoord_scaling        = com
>         ; have a look at GROMACS documentation 7.
>
> ; Velocity generation
> gen_vel         = no                     ; generate velocities in grompp
> according to a Maxwell distribution
>
> -------- END MDP --------
>
> -------- START STATISTICS  --------
>
>         P P   -   P M E   L O A D   B A L A N C I N G
>
>   PP/PME load balancing changed the cut-off and PME settings:
>             particle-particle                    PME
>              rcoulomb  rlist            grid      spacing   1/beta
>     initial  1.000 nm  1.155 nm     100 128  96   0.120 nm  0.320 nm
>     final    1.201 nm  1.356 nm      96 100  80   0.144 nm  0.385 nm
>   cost-ratio           1.62             0.62
>   (note that these numbers concern only part of the total PP and PME
>   load)
>
>     M E G A - F L O P S   A C C O U N T I N G
>
>   NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>   RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>   W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>   V&F=Potential and force  V=Potential only  F=Force only
>
>      D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>
>   av. #atoms communicated per step for force:  2 x 54749.0
>   av. #atoms communicated per step for LINCS:  2 x 5418.4
>
>   Average load imbalance: 12.8 %
>   Part of the total run time spent waiting due to load imbalance: 1.4 %
>   Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
>   Y 0 %
>
>
>       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>   Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles
>   %
> -----------------------------------------------------------------------------
>   Domain decomp.         3    4     625000   10388.307   315806.805
>   2.3
>   DD comm. load          3    4     625000     129.908     3949.232
>   0.0
>   DD comm. bounds        3    4     625001     267.204     8123.069
>   0.1
>   Neighbor search        3    4     625001    7756.651   235803.900
>   1.7
>   Launch GPU ops.        3    4   50000002    3376.764   102654.354
>   0.8
>   Comm. coord.           3    4   24375000   10651.213   323799.209
>   2.4
>   Force                  3    4   25000001   35732.146  1086265.102
>   8.0
>   Wait + Comm. F         3    4   25000001   19866.403   603943.033
>   4.5
>   PME mesh               3    4   25000001  235964.754  7173380.387
>   53.0
>   Wait GPU nonlocal      3    4   25000001   12055.970   366504.140
>   2.7
>   Wait GPU local         3    4   25000001     106.179     3227.866
>   0.0
>   NB X/F buffer ops.     3    4   98750002   10256.750   311807.459
>   2.3
>   Write traj.            3    4       2994     249.770     7593.073
>   0.1
>   Update                 3    4   25000001   33108.852  1006516.379
>   7.4
>   Constraints            3    4   25000001   51671.482  1570824.423
>   11.6
>   Comm. energies         3    4    2500001     463.135    14079.404
>   0.1
>   Rest                   3                   13290.037   404020.040
>   3.0
> -----------------------------------------------------------------------------
>   Total                  3                  445335.526 13538297.876
>   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>   PME redist. X/F        3    4   50000002   40747.165  1238722.760
>   9.1
>   PME spread/gather      3    4   50000002  122026.128  3709621.109
>   27.4
>   PME 3D-FFT             3    4   50000002   46613.023  1417046.140
>   10.5
>   PME 3D-FFT Comm.       3    4   50000002   20934.134   636402.285
>   4.7
>   PME solve              3    4   25000001    5465.690   166158.163
>   1.2
> -----------------------------------------------------------------------------
>
>                 Core t (s)   Wall t (s)        (%)
>         Time:  5317976.200   445335.526     1194.2
>                           5d03h42:15
>                   (ns/day)    (hour/ns)
> Performance:        9.701        2.474
>
> -------- END STATISTICS  --------
>
> thank you very much for the help.
> cheers,
> Fra
>