[gmx-users] low performance 2 GTX 980+ Intel CPU Core i7-5930K 3.5 GHz (2011-3)
Carlos Navarro Retamal
carlos.navarro87 at gmail.com
Tue Dec 30 17:50:55 CET 2014
Dear gromacs users,
I add now the log file of the most important test (2 GPU’s-> 1 job and 1GPU-> 1job)
Both GPU’s:
command:
mdrun -v
file: http://cl.ly/3w1C2k1X2J2W
> P P - P M E L O A D B A L A N C I N G
>
> PP/PME load balancing changed the cut-off and PME settings:
> particle-particle PME
> rcoulomb rlist grid spacing 1/beta
> initial 0.800 nm 0.861 nm 96 42 44 0.157 nm 0.256 nm
> final 1.304 nm 1.365 nm 60 25 28 0.261 nm 0.418 nm
> cost-ratio 3.99 0.24
> (note that these numbers concern only part of the total PP and PME load)
>
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 16236.7
> av. #atoms communicated per step for vsites: 3 x 282.4
> av. #atoms communicated per step for LINCS: 2 x 876.5
>
> Average load imbalance: 4.4 %
> Part of the total run time spent waiting due to load imbalance: 0.5 %
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 2 MPI ranks, each using 6 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Domain decomp. 2 6 2000 11.978 503.055 4.8
> DD comm. load 2 6 472 0.003 0.141 0.0
> Vsite constr. 2 6 50001 7.975 334.960 3.2
> Neighbor search 2 6 2001 11.470 481.752 4.6
> Launch GPU ops. 2 6 100002 5.066 212.761 2.0
> Comm. coord. 2 6 48000 3.355 140.897 1.4
> Force 2 6 50001 23.603 991.346 9.5
> Wait + Comm. F 2 6 50001 3.747 157.374 1.5
> PME mesh 2 6 50001 103.250 4336.481 41.6
> Wait GPU nonlocal 2 6 50001 2.200 92.408 0.9
> Wait GPU local 2 6 50001 0.119 5.015 0.0
> NB X/F buffer ops. 2 6 196002 10.840 455.295 4.4
> Vsite spread 2 6 52002 13.895 583.601 5.6
> Write traj. 2 6 6 0.098 4.137 0.0
> Update 2 6 50001 17.625 740.244 7.1
> Constraints 2 6 50001 22.554 947.251 9.1
> Comm. energies 2 6 2001 0.025 1.035 0.0
> Rest 10.193 428.120 4.1
> -----------------------------------------------------------------------------
> Total 247.997 10415.874 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME redist. X/F 2 6 100002 19.102 802.277 7.7
> PME spread/gather 2 6 100002 75.116 3154.875 30.3
> PME 3D-FFT 2 6 100002 6.673 280.281 2.7
> PME 3D-FFT Comm. 2 6 100002 1.437 60.365 0.6
> PME solve Elec 2 6 50001 0.627 26.314 0.3
> -----------------------------------------------------------------------------
>
> Core t (s) Wall t (s) (%)
> Time: 2951.251 247.997 1190.0
> (ns/day) (hour/ns)
> Performance: 34.840 0.689
> Finished mdrun on rank 0 Tue Dec 30 10:49:37 2014
1 GPU:
command:
mdrun -gpu_id 0 -v
file: http://cl.ly/1y3V0j2f263U
> P P - P M E L O A D B A L A N C I N G
>
> PP/PME load balancing changed the cut-off and PME settings:
> particle-particle PME
> rcoulomb rlist grid spacing 1/beta
> initial 0.800 nm 0.861 nm 96 42 44 0.157 nm 0.256 nm
> final 0.906 nm 0.967 nm 84 36 40 0.181 nm 0.290 nm
> cost-ratio 1.42 0.68
> (note that these numbers concern only part of the total PP and PME load)
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 12 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Vsite constr. 1 12 50001 5.023 210.973 2.4
> Neighbor search 1 12 2001 10.913 458.333 5.1
> Launch GPU ops. 1 12 50001 3.921 164.665 1.8
> Force 1 12 50001 18.450 774.912 8.7
> PME mesh 1 12 50001 97.768 4106.267 46.0
> Wait GPU local 1 12 50001 4.037 169.543 1.9
> NB X/F buffer ops. 1 12 98001 9.479 398.121 4.5
> Vsite spread 1 12 52002 6.739 283.049 3.2
> Write traj. 1 12 6 0.182 7.663 0.1
> Update 1 12 50001 18.434 774.236 8.7
> Constraints 1 12 50001 20.438 858.380 9.6
> Rest 17.040 715.679 8.0
> -----------------------------------------------------------------------------
> Total 212.424 8921.821 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME spread/gather 1 12 100002 80.808 3393.931 38.0
> PME 3D-FFT 1 12 100002 15.250 640.517 7.2
> PME solve Elec 1 12 50001 1.270 53.334 0.6
> -----------------------------------------------------------------------------
>
> GPU timings
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
> -----------------------------------------------------------------------------
> Pair list H2D 2001 1.362 0.681 1.2
> X / q H2D 50001 12.740 0.255 10.9
> Nonbonded F kernel 48000 87.147 1.816 74.6
> Nonbonded F+ene+prune k. 2001 6.513 3.255 5.6
> F D2H 50001 9.063 0.181 7.8
> -----------------------------------------------------------------------------
> Total 116.825 2.336 100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 2.336 ms/2.324 ms = 1.005
> For optimal performance this ratio should be close to 1!
>
> Core t (s) Wall t (s) (%)
> Time: 2524.140 212.424 1188.3
> (ns/day) (hour/ns)
> Performance: 40.674 0.590
> Finished mdrun on rank 0 Tue Dec 30 10:43:59 2014
kind regards,
Carlos
--
Carlos Navarro Retamal
Bioinformatic engineer
Ph.D(c) in Applied Science, Universidad de Talca, Chile
Center of Bioinformatics and Molecular Simulations (CBSM)
Universidad de Talca
2 Norte 685, Casilla 721, Talca - Chile
Teléfono: 56-71-201 798,
Fax: 56-71-201 561
Email: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
On Monday, December 29, 2014 at 9:43 PM, Carlos Navarro Retamal wrote:
> Dear gromacs users,
> I just recently bought a workstation that posees two GTX 980 plus an i7 (Intel CPU Core i7-5930K 3.5 GHz (2011-3)).
> In order to test it, i run a MD simulation of a system containing ~90k atoms.
> These are the performances:
>
> 2 GPU’s (1 job):
> 34ns/day (each cards were working about ~40%)
>
> 1 GPU (Nº1) (1 job):
> 37ns/day (~65% of performance)
>
> 1 GPU (Nº2) (1 job):
> 36ns/day (~65% of performance)
>
> 2 GPU’s (2 jobs simultaneously )
> 16ns/day and 16ns/day respectively. (~20% of performance each)
>
> With respect to the last test, the .log file show the next message:
>
> Force evaluation time GPU/CPU: 3.177 ms/5.804 ms = 0.547
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance cause
> performance loss.
>
>
> So probably, since all the cpu is splitting between each job, the ratio GPU/CPU will be worse.
>
> Is there a way i can solve this issue (is kind of sad that i’m getting a better performance with one GPU instead of two, since i saw that when i add a third or even a fourth one the performance start to decrease).
> Here’s my .mdp file:
>
> > title = Protein-ligand complex MD simulation
> > ; Run parameters
> > integrator = md ; leap-frog integrator
> > nsteps = 15000000 ; 2 * 1500000 = 30000 ps (30ns)
> > dt = 0.002 ; 2 fs
> > ; Output control
> > nstxout = 0 ; suppress .trr output
> > nstvout = 0 ; suppress .trr output
> > nstenergy = 10000 ; save energies every 2 ps
> > nstlog = 10000 ; update log file every 2 ps
> > nstxtcout = 15000 ; write .xtc trajectory every 2 ps
> > energygrps = Protein non-Protein
> > ; Bond parameters
> > continuation = yes ; first dynamics run
> > constraint_algorithm = lincs ; holonomic constraints
> > constraints = all-bonds ; all bonds (even heavy atom-H bonds) c
> > lincs_iter = 1 ; accuracy of LINCS
> > lincs_order = 4 ; also related to accuracy
> > ; Neighborsearching
> > ns_type = grid ; search neighboring grid cells
> > nstlist = 10 ; 10 fs
> > cutoff-scheme = Verlet
> > rlist = 1.0 ; short-range neighborlist cutoff (in nm)
> > rcoulomb = 1.0 ; short-range electrostatic cutoff (in nm)
> > rvdw = 1.0 ; short-range van der Waals cutoff (in nm)
> > ; Electrostatics
> > coulombtype = PME ; Particle Mesh Ewald for long-range electr
> > pme_order = 4 ; cubic interpolation
> > fourierspacing = 0.16 ; grid spacing for FFT
> > ; Temperature coupling
> > tcoupl = V-rescale ; modified Berendsen thermo
> > tc-grps = Protein non-Protein ; two coupling groups - more accur
> > tau_t = 0.1 0.1 ; time constant, in ps
> > ref_t = 300 300 ; reference temperature,
> > ; Pressure coupling
> > pcoupl = Parrinello-Rahman ; pressure coupling is on f
> > pcoupltype = isotropic ; uniform scaling of box ve
> > tau_p = 2.0 ; time constant, in ps
> > ref_p = 1.0 ; reference pressure, in ba
> > compressibility = 4.5e-5 ; isothermal compressibilit
> > ; Periodic boundary conditions
> > pbc = xyz ; 3-D PBC
> > ; Dispersion correction
> > DispCorr = EnerPres ; account for cut-off vdW scheme
> > ; Velocity generation
> > gen_vel = no ; assign velocities from Maxwell distribution
> >
>
>
> Kind regards,
> Carlos
> --
> Carlos Navarro Retamal
> Bioinformatic engineer
> Ph.D(c) in Applied Science, Universidad de Talca, Chile
> Center of Bioinformatics and Molecular Simulations (CBSM)
> Universidad de Talca
> 2 Norte 685, Casilla 721, Talca - Chile
> Teléfono: 56-71-201 798,
> Fax: 56-71-201 561
> Email: carlos.navarro87 at gmail.com (mailto:carlos.navarro87 at gmail.com) or cnavarro at utalca.cl (mailto:cnavarro at utalca.cl)
>
More information about the gromacs.org_gmx-users
mailing list