[gmx-users] gpu cluster explanation
Francesco
fracarb at myopera.com
Fri Jul 12 14:26:14 CEST 2013
Hi all,
I'm working with a 200K atoms system (protein + explicit water) and
after a while using a cpu cluster I had to switch to a gpu cluster.
I read both Acceleration and parallelization and Gromacs-gpu
documentation pages
(http://www.gromacs.org/Documentation/Acceleration_and_parallelization
and
http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
but it's a bit confusing and I need help to understand if I really have
understood correctly. :)
I have 2 type of nodes:
3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
2.53GHz)
8gpu and 2 cpu (6 cores each)
1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
MPI max.
2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
4x3= 12
now if I have a node with 8 gpu, I can use 4 gpu:
4 MPI and 3 OpenMP
is it right?
is it possible to use 8 gpu and 8 cores only?
Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
gpu and 12 cores I get 9-11 ns/day.
the command that I use is:
mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
with n° gpu set via script :
#BSUB -n 3
I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.
The mdp file and some statistics are following:
-------- START MDP --------
title = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD
; Run parameters
integrator = md ; Algorithm options
nsteps = 25000000 ; maximum number of steps to
perform [50 ns]
dt = 0.002 ; 2 fs = 0.002 ps
; Output control
nstxout = 10000 ; [steps] freq to write coordinates to
trajectory, the last coordinates are always written
nstvout = 10000 ; [steps] freq to write velocities to
trajectory, the last velocities are always written
nstlog = 10000 ; [steps] freq to write energies to log
file, the last energies are always written
nstenergy = 10000 ; [steps] write energies to disk
every nstenergy steps
nstxtcout = 10000 ; [steps] freq to write coordinates to
xtc trajectory
xtc_precision = 1000 ; precision to write to xtc trajectory
(1000 = default)
xtc_grps = system ; which coordinate
group(s) to write to disk
energygrps = system ; or System / which energy
group(s) to writk
; Bond parameters
continuation = yes ; restarting from npt
constraints = all-bonds ; Bond types to replace by constraints
constraint_algorithm = lincs ; holonomic constraints
lincs_iter = 1 ; accuracy of LINCS
lincs_order = 4 ; also related to
accuracy
lincs_warnangle = 30 ; [degrees] maximum angle that a bond can
rotate before LINCS will complain
; Neighborsearching
ns_type = grid ; method of updating neighbor list
cutoff-scheme = Verlet
nstlist = 10 ; [steps] frequence to update
neighbor list (10)
rlist = 1.0 ; [nm] cut-off distance for the
short-range neighbor list (1 default)
rcoulomb = 1.0 ; [nm] long range electrostatic cut-off
rvdw = 1.0 ; [nm] long range Van der Waals cut-off
; Electrostatics
coulombtype = PME ; treatment of long range electrostatic
interactions
vdwtype = cut-off ; treatment of Van der Waals
interactions
; Periodic boundary conditions
pbc = xyz
; Dispersion correction
DispCorr = EnerPres ; appling long
range dispersion corrections
; Ewald
fourierspacing = 0.12 ; grid spacing for FFT -
controll the higest magnitude of wave vectors (0.12)
pme_order = 4 ; interpolation order for PME, 4 = cubic
ewald_rtol = 1e-5 ; relative strength of Ewald-shifted
potential at rcoulomb
; Temperature coupling
tcoupl = nose-hoover ; temperature
coupling with Nose-Hoover ensemble
tc_grps = Protein Non-Protein
tau_t = 0.4 0.4 ; [ps]
time constant
ref_t = 310 310 ; [K]
reference temperature for coupling [310 = 28°C
; Pressure coupling
pcoupl = parrinello-rahman
pcoupltype = isotropic ;
uniform scaling of box vect
tau_p = 2.0
; [ps] time constant
ref_p = 1.0
; [bar] reference pressure for coupling
compressibility = 4.5e-5
; [bar^-1] isothermal compressibility of water
refcoord_scaling = com
; have a look at GROMACS documentation 7.
; Velocity generation
gen_vel = no ; generate velocities in grompp
according to a Maxwell distribution
-------- END MDP --------
-------- START STATISTICS --------
P P - P M E L O A D B A L A N C I N G
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.155 nm 100 128 96 0.120 nm 0.320 nm
final 1.201 nm 1.356 nm 96 100 80 0.144 nm 0.385 nm
cost-ratio 1.62 0.62
(note that these numbers concern only part of the total PP and PME
load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 54749.0
av. #atoms communicated per step for LINCS: 2 x 5418.4
Average load imbalance: 12.8 %
Part of the total run time spent waiting due to load imbalance: 1.4 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
Y 0 %
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Th. Count Wall t (s) G-Cycles
%
-----------------------------------------------------------------------------
Domain decomp. 3 4 625000 10388.307 315806.805
2.3
DD comm. load 3 4 625000 129.908 3949.232
0.0
DD comm. bounds 3 4 625001 267.204 8123.069
0.1
Neighbor search 3 4 625001 7756.651 235803.900
1.7
Launch GPU ops. 3 4 50000002 3376.764 102654.354
0.8
Comm. coord. 3 4 24375000 10651.213 323799.209
2.4
Force 3 4 25000001 35732.146 1086265.102
8.0
Wait + Comm. F 3 4 25000001 19866.403 603943.033
4.5
PME mesh 3 4 25000001 235964.754 7173380.387
53.0
Wait GPU nonlocal 3 4 25000001 12055.970 366504.140
2.7
Wait GPU local 3 4 25000001 106.179 3227.866
0.0
NB X/F buffer ops. 3 4 98750002 10256.750 311807.459
2.3
Write traj. 3 4 2994 249.770 7593.073
0.1
Update 3 4 25000001 33108.852 1006516.379
7.4
Constraints 3 4 25000001 51671.482 1570824.423
11.6
Comm. energies 3 4 2500001 463.135 14079.404
0.1
Rest 3 13290.037 404020.040
3.0
-----------------------------------------------------------------------------
Total 3 445335.526 13538297.876
100.0
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
PME redist. X/F 3 4 50000002 40747.165 1238722.760
9.1
PME spread/gather 3 4 50000002 122026.128 3709621.109
27.4
PME 3D-FFT 3 4 50000002 46613.023 1417046.140
10.5
PME 3D-FFT Comm. 3 4 50000002 20934.134 636402.285
4.7
PME solve 3 4 25000001 5465.690 166158.163
1.2
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 5317976.200 445335.526 1194.2
5d03h42:15
(ns/day) (hour/ns)
Performance: 9.701 2.474
-------- END STATISTICS --------
thank you very much for the help.
cheers,
Fra
--
Francesco Carbone
PhD student
Institute of Structural and Molecular Biology
UCL, London
fra.carbone.12 at ucl.ac.uk
More information about the gromacs.org_gmx-users
mailing list