[gmx-users] gpu cluster explanation

Fri Jul 12 14:26:14 CEST 2013

Hi all,
I'm working with a 200K atoms system (protein + explicit water) and
after a while using a cpu cluster I had to switch to a gpu cluster.
I read both Acceleration and parallelization and Gromacs-gpu
documentation pages
(http://www.gromacs.org/Documentation/Acceleration_and_parallelization
and
http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
but it's a bit confusing and I need help to understand if I really have
understood correctly. :)
I have 2 type of nodes:
3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
2.53GHz)
8gpu and 2 cpu (6 cores each)

1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
MPI max.
2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
4x3= 12

now if I have a node with 8 gpu, I can use 4 gpu:
4 MPI and 3 OpenMP 
is it right?
is it possible to use 8 gpu and 8 cores only?

Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
gpu  and 12 cores I get 9-11 ns/day.

the command that I use is:
mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
with n° gpu set via script :
#BSUB -n 3

I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.

The mdp file and some statistics are following:

-------- START MDP --------

title             = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD

; Run parameters
integrator              = md                    ; Algorithm options
nsteps                  = 25000000      ; maximum number of steps to
perform [50 ns]
dt                      = 0.002         ; 2 fs = 0.002 ps

; Output control
nstxout            = 10000     ; [steps] freq to write coordinates to
trajectory, the last coordinates are always written
nstvout            = 10000     ; [steps] freq to write velocities to
trajectory, the last velocities are always written
nstlog              = 10000     ; [steps] freq to write energies to log
file, the last energies are always written
nstenergy         = 10000          ; [steps] write energies to disk
every nstenergy steps
nstxtcout          = 10000     ; [steps] freq to write coordinates to
xtc trajectory
xtc_precision   = 1000          ; precision to write to xtc trajectory
(1000 = default)
xtc_grps                = system                ; which coordinate
group(s) to write to disk 
energygrps      = system                ; or System / which energy
group(s) to writk

; Bond parameters
continuation    = yes                   ; restarting from npt
constraints     = all-bonds     ; Bond types to replace by constraints
constraint_algorithm    = lincs         ; holonomic constraints
lincs_iter              = 1                     ; accuracy of LINCS
lincs_order             = 4                     ; also related to
accuracy
lincs_warnangle  = 30        ; [degrees] maximum angle that a bond can
rotate before LINCS will complain

; Neighborsearching
ns_type                 = grid      ; method of updating neighbor list
cutoff-scheme     = Verlet
nstlist                 = 10        ; [steps] frequence to update
neighbor list (10)
rlist                 = 1.0       ; [nm] cut-off distance for the
short-range neighbor list  (1 default)
rcoulomb          = 1.0       ; [nm] long range electrostatic cut-off
rvdw              = 1.0       ; [nm]  long range Van der Waals cut-off

; Electrostatics
coulombtype    = PME          ; treatment of long range electrostatic
interactions              
vdwtype         = cut-off       ; treatment of Van der Waals
interactions

; Periodic boundary conditions
pbc                     = xyz      

; Dispersion correction
DispCorr                                = EnerPres      ; appling long
range dispersion corrections

; Ewald
fourierspacing    = 0.12                ; grid spacing for FFT  -
controll the higest magnitude of wave vectors (0.12)
pme_order         = 4         ; interpolation order for PME, 4 = cubic
ewald_rtol        = 1e-5      ; relative strength of Ewald-shifted
potential at rcoulomb

; Temperature coupling
tcoupl          = nose-hoover                           ; temperature
coupling with Nose-Hoover ensemble
tc_grps         = Protein Non-Protein
tau_t                   = 0.4        0.4                        ; [ps]
time constant
ref_t                   = 310        310                        ; [K]
reference temperature for coupling [310 = 28°C

; Pressure coupling
pcoupl          = parrinello-rahman     
pcoupltype        = isotropic                                   ;
uniform scaling of box vect
tau_p           = 2.0                                                  
; [ps] time constant
ref_p                   = 1.0                                           
       ; [bar] reference pressure for coupling
compressibility = 4.5e-5                                               
; [bar^-1] isothermal compressibility of water
refcoord_scaling        = com                                           
       ; have a look at GROMACS documentation 7.

; Velocity generation
gen_vel         = no                     ; generate velocities in grompp
according to a Maxwell distribution

-------- END MDP --------

-------- START STATISTICS  --------

       P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.000 nm  1.155 nm     100 128  96   0.120 nm  0.320 nm
   final    1.201 nm  1.356 nm      96 100  80   0.144 nm  0.385 nm
 cost-ratio           1.62             0.62
 (note that these numbers concern only part of the total PP and PME
 load)

   M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 54749.0
 av. #atoms communicated per step for LINCS:  2 x 5418.4

 Average load imbalance: 12.8 %
 Part of the total run time spent waiting due to load imbalance: 1.4 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
 Y 0 %

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles      
 %
-----------------------------------------------------------------------------
 Domain decomp.         3    4     625000   10388.307   315806.805    
 2.3
 DD comm. load          3    4     625000     129.908     3949.232    
 0.0
 DD comm. bounds        3    4     625001     267.204     8123.069    
 0.1
 Neighbor search        3    4     625001    7756.651   235803.900    
 1.7
 Launch GPU ops.        3    4   50000002    3376.764   102654.354    
 0.8
 Comm. coord.           3    4   24375000   10651.213   323799.209    
 2.4
 Force                  3    4   25000001   35732.146  1086265.102    
 8.0
 Wait + Comm. F         3    4   25000001   19866.403   603943.033    
 4.5
 PME mesh               3    4   25000001  235964.754  7173380.387   
 53.0
 Wait GPU nonlocal      3    4   25000001   12055.970   366504.140    
 2.7
 Wait GPU local         3    4   25000001     106.179     3227.866    
 0.0
 NB X/F buffer ops.     3    4   98750002   10256.750   311807.459    
 2.3
 Write traj.            3    4       2994     249.770     7593.073    
 0.1
 Update                 3    4   25000001   33108.852  1006516.379    
 7.4
 Constraints            3    4   25000001   51671.482  1570824.423   
 11.6
 Comm. energies         3    4    2500001     463.135    14079.404    
 0.1
 Rest                   3                   13290.037   404020.040    
 3.0
-----------------------------------------------------------------------------
 Total                  3                  445335.526 13538297.876  
 100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
 PME redist. X/F        3    4   50000002   40747.165  1238722.760    
 9.1
 PME spread/gather      3    4   50000002  122026.128  3709621.109   
 27.4
 PME 3D-FFT             3    4   50000002   46613.023  1417046.140   
 10.5
 PME 3D-FFT Comm.       3    4   50000002   20934.134   636402.285    
 4.7
 PME solve              3    4   25000001    5465.690   166158.163    
 1.2
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:  5317976.200   445335.526     1194.2
                         5d03h42:15
                 (ns/day)    (hour/ns)
Performance:        9.701        2.474

-------- END STATISTICS  --------

thank you very much for the help.
cheers,
Fra

-- 
Francesco Carbone
PhD student
Institute of Structural and Molecular Biology
UCL, London
fra.carbone.12 at ucl.ac.uk