[gmx-users] gpu cluster explanation
Richard Broadbent
richard.broadbent09 at imperial.ac.uk
Fri Jul 12 17:41:27 CEST 2013
On 12/07/13 13:26, Francesco wrote:
> Hi all,
> I'm working with a 200K atoms system (protein + explicit water) and
> after a while using a cpu cluster I had to switch to a gpu cluster.
> I read both Acceleration and parallelization and Gromacs-gpu
> documentation pages
> (http://www.gromacs.org/Documentation/Acceleration_and_parallelization
> and
> http://www.gromacs.org/Documentation/Installation_Instructions_4.5/GROMACS-OpenMM)
> but it's a bit confusing and I need help to understand if I really have
> understood correctly. :)
> I have 2 type of nodes:
> 3gpu ( NVIDIA Tesla M2090) and 2 cpu 6cores each (Intel Xeon E5649 @
> 2.53GHz)
> 8gpu and 2 cpu (6 cores each)
>
> 1) I can only have 1 MPI per gpu, meaning that with 3 gpu I can have 3
> MPI max.
> 2) because I have 12 cores I can open 4 OPenMP threads x MPI, because
> 4x3= 12
>
> now if I have a node with 8 gpu, I can use 4 gpu:
> 4 MPI and 3 OpenMP
> is it right?
> is it possible to use 8 gpu and 8 cores only?
you could set -ntomp 0, however and setup mpi/thread mpi to use 8 cores.
However, a system that unbalanced (huge amount of gpu power to
comparatively little cpu power) is unlikely to get great performance.
>
> Using gromacs 4.6.2 and 144 cpu cores I reach 35 ns/day, while with 3
> gpu and 12 cores I get 9-11 ns/day.
>
That slowdown is in line with what I got when I tried a similar cpu-gpu
setup. That said other's might have some advice that will improve your
performance.
> the command that I use is:
> mdrun -dlb yes -s input_50.tpr -deffnm 306s_50 -v
> with n° gpu set via script :
> #BSUB -n 3
>
> I also tried to set -npme / -nt / -ntmpi / -ntomp, but nothing changes.
>
> The mdp file and some statistics are following:
>
> -------- START MDP --------
>
> title = G6PD wt molecular dynamics (2bhl.pdb) - NPT MD
>
> ; Run parameters
> integrator = md ; Algorithm options
> nsteps = 25000000 ; maximum number of steps to
> perform [50 ns]
> dt = 0.002 ; 2 fs = 0.002 ps
>
> ; Output control
> nstxout = 10000 ; [steps] freq to write coordinates to
> trajectory, the last coordinates are always written
> nstvout = 10000 ; [steps] freq to write velocities to
> trajectory, the last velocities are always written
> nstlog = 10000 ; [steps] freq to write energies to log
> file, the last energies are always written
> nstenergy = 10000 ; [steps] write energies to disk
> every nstenergy steps
> nstxtcout = 10000 ; [steps] freq to write coordinates to
> xtc trajectory
> xtc_precision = 1000 ; precision to write to xtc trajectory
> (1000 = default)
> xtc_grps = system ; which coordinate
> group(s) to write to disk
> energygrps = system ; or System / which energy
> group(s) to writk
>
> ; Bond parameters
> continuation = yes ; restarting from npt
> constraints = all-bonds ; Bond types to replace by constraints
> constraint_algorithm = lincs ; holonomic constraints
> lincs_iter = 1 ; accuracy of LINCS
> lincs_order = 4 ; also related to
> accuracy
> lincs_warnangle = 30 ; [degrees] maximum angle that a bond can
> rotate before LINCS will complain
>
That seems a little loose for constraints but setting that up and
checking it's conserving energy and preserving bond lengths is something
you'll have to do yourself
Richard
> ; Neighborsearching
> ns_type = grid ; method of updating neighbor list
> cutoff-scheme = Verlet
> nstlist = 10 ; [steps] frequence to update
> neighbor list (10)
> rlist = 1.0 ; [nm] cut-off distance for the
> short-range neighbor list (1 default)
> rcoulomb = 1.0 ; [nm] long range electrostatic cut-off
> rvdw = 1.0 ; [nm] long range Van der Waals cut-off
>
> ; Electrostatics
> coulombtype = PME ; treatment of long range electrostatic
> interactions
> vdwtype = cut-off ; treatment of Van der Waals
> interactions
>
> ; Periodic boundary conditions
> pbc = xyz
>
> ; Dispersion correction
> DispCorr = EnerPres ; appling long
> range dispersion corrections
>
> ; Ewald
> fourierspacing = 0.12 ; grid spacing for FFT -
> controll the higest magnitude of wave vectors (0.12)
> pme_order = 4 ; interpolation order for PME, 4 = cubic
> ewald_rtol = 1e-5 ; relative strength of Ewald-shifted
> potential at rcoulomb
>
> ; Temperature coupling
> tcoupl = nose-hoover ; temperature
> coupling with Nose-Hoover ensemble
> tc_grps = Protein Non-Protein
> tau_t = 0.4 0.4 ; [ps]
> time constant
> ref_t = 310 310 ; [K]
> reference temperature for coupling [310 = 28°C
>
> ; Pressure coupling
> pcoupl = parrinello-rahman
> pcoupltype = isotropic ;
> uniform scaling of box vect
> tau_p = 2.0
> ; [ps] time constant
> ref_p = 1.0
> ; [bar] reference pressure for coupling
> compressibility = 4.5e-5
> ; [bar^-1] isothermal compressibility of water
> refcoord_scaling = com
> ; have a look at GROMACS documentation 7.
>
> ; Velocity generation
> gen_vel = no ; generate velocities in grompp
> according to a Maxwell distribution
>
> -------- END MDP --------
>
> -------- START STATISTICS --------
>
> P P - P M E L O A D B A L A N C I N G
>
> PP/PME load balancing changed the cut-off and PME settings:
> particle-particle PME
> rcoulomb rlist grid spacing 1/beta
> initial 1.000 nm 1.155 nm 100 128 96 0.120 nm 0.320 nm
> final 1.201 nm 1.356 nm 96 100 80 0.144 nm 0.385 nm
> cost-ratio 1.62 0.62
> (note that these numbers concern only part of the total PP and PME
> load)
>
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 54749.0
> av. #atoms communicated per step for LINCS: 2 x 5418.4
>
> Average load imbalance: 12.8 %
> Part of the total run time spent waiting due to load imbalance: 1.4 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
> Y 0 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles
> %
> -----------------------------------------------------------------------------
> Domain decomp. 3 4 625000 10388.307 315806.805
> 2.3
> DD comm. load 3 4 625000 129.908 3949.232
> 0.0
> DD comm. bounds 3 4 625001 267.204 8123.069
> 0.1
> Neighbor search 3 4 625001 7756.651 235803.900
> 1.7
> Launch GPU ops. 3 4 50000002 3376.764 102654.354
> 0.8
> Comm. coord. 3 4 24375000 10651.213 323799.209
> 2.4
> Force 3 4 25000001 35732.146 1086265.102
> 8.0
> Wait + Comm. F 3 4 25000001 19866.403 603943.033
> 4.5
> PME mesh 3 4 25000001 235964.754 7173380.387
> 53.0
> Wait GPU nonlocal 3 4 25000001 12055.970 366504.140
> 2.7
> Wait GPU local 3 4 25000001 106.179 3227.866
> 0.0
> NB X/F buffer ops. 3 4 98750002 10256.750 311807.459
> 2.3
> Write traj. 3 4 2994 249.770 7593.073
> 0.1
> Update 3 4 25000001 33108.852 1006516.379
> 7.4
> Constraints 3 4 25000001 51671.482 1570824.423
> 11.6
> Comm. energies 3 4 2500001 463.135 14079.404
> 0.1
> Rest 3 13290.037 404020.040
> 3.0
> -----------------------------------------------------------------------------
> Total 3 445335.526 13538297.876
> 100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> PME redist. X/F 3 4 50000002 40747.165 1238722.760
> 9.1
> PME spread/gather 3 4 50000002 122026.128 3709621.109
> 27.4
> PME 3D-FFT 3 4 50000002 46613.023 1417046.140
> 10.5
> PME 3D-FFT Comm. 3 4 50000002 20934.134 636402.285
> 4.7
> PME solve 3 4 25000001 5465.690 166158.163
> 1.2
> -----------------------------------------------------------------------------
>
> Core t (s) Wall t (s) (%)
> Time: 5317976.200 445335.526 1194.2
> 5d03h42:15
> (ns/day) (hour/ns)
> Performance: 9.701 2.474
>
> -------- END STATISTICS --------
>
> thank you very much for the help.
> cheers,
> Fra
>
More information about the gromacs.org_gmx-users
mailing list