[gmx-users] The question of performance of GPU acceleration
Szilárd Páll
pall.szilard at gmail.com
Wed Jul 6 13:58:08 CEST 2016
Have you tested different ways to launch the run (different number of
ranks, threads)? With 12 ranks you seem to be getting quite some load
imabalance, though that may or may not matter. Why pme_order=6?
Share full log files, please, there is much more info in it than just
the bit you pasted here.
--
Szilárd
On Wed, Jul 6, 2016 at 5:56 AM, DeChang Li <li.dc06 at gmail.com> wrote:
> Dear all,
>
> I used GPU acceleration in Gromacs-5.0.4. I want to know whether the
> acceleration performance is good or not.
>
>
> Here is my hardware:
>
> 2 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, totally 12 physical cores.
> 2 board of NVIDIA Tesla K10 GPU, totally 6144 GPU processor cores.
> 32GB DDR4 2133MHz memory
>
>
> My simulation system contain about 480,000 atoms, used PME with grid 0.16,
> pme_order=6, non-bonded cut-off 1nm, nstlist = 40.
>
> The following is the performance:
>
> M E G A - F L O P S A C C O U N T I N G
>
> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> V&F=Potential and force V=Potential only F=Force only
>
> Computing: M-Number M-Flops % Flops
> -----------------------------------------------------------------------------
> NB VdW [V&F] 30048.030048 30048.030 0.0
> Pair Search distance check 1643106.927152 14787962.344 0.1
> NxN Ewald Elec. + LJ [F] 422678896.374912 27896807160.744 95.8
> NxN Ewald Elec. + LJ [V&F] 4269950.969088 456884753.692 1.6
> 1,4 nonbonded interactions 43309.043309 3897813.898 0.0
> Calc Weights 1426624.426623 51358479.358 0.2
> Spread Q Bspline 30434654.434624 60869308.869 0.2
> Gather F Bspline 30434654.434624 182607926.608 0.6
> 3D-FFT 44406686.579502 355253492.636 1.2
> Solve PME 138238.986240 8847295.119 0.0
> Reset In Box 11888.525000 35665.575 0.0
> CG-CoM 11889.000541 35667.002 0.0
> Propers 36440.036440 8344768.345 0.0
> Impropers 2746.002746 571168.571 0.0
> Virial 19043.716081 342786.889 0.0
> Stop-CM 4755.885541 47558.855 0.0
> Calc-Ekin 95109.151082 2567947.079 0.0
> Lincs 27924.896602 1675493.796 0.0
> Lincs-Mat 809415.182240 3237660.729 0.0
> Constraint-V 539622.857861 4316982.863 0.0
> Constraint-Vir 20468.413005 491241.912 0.0
> Settle 161257.688219 52086233.295 0.2
> (null) 1060.001060 0.000 0.0
> -----------------------------------------------------------------------------
> Total 29105097416.211 100.0
> -----------------------------------------------------------------------------
>
>
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 341414.5
> av. #atoms communicated per step for LINCS: 2 x 36121.5
>
> Average load imbalance: 75.3 %
> Part of the total run time spent waiting due to load imbalance: 4.1 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Y 2
> %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 12 MPI ranks
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
> -----------------------------------------------------------------------------
> Domain decomp. 12 1 25000 1011.294 29057.964 2.9
> DD comm. load 12 1 25000 85.817 2465.806 0.2
> DD comm. bounds 12 1 25000 124.974 3590.946 0.4
> Neighbor search 12 1 25001 483.458 13891.413 1.4
> Launch GPU ops. 12 1 2000002 124.094 3565.638 0.4
> Comm. coord. 12 1 975000 1905.608 54754.706 5.5
> Force 12 1 1000001 1489.965 42811.850 4.3
> Wait + Comm. F 12 1 1000001 435.575 12515.570 1.3
> PME mesh 12 1 1000001 19507.755 560525.285 56.2
> Wait GPU nonlocal 12 1 1000001 17.722 509.211 0.1
> Wait GPU local 12 1 1000001 5.608 161.146 0.0
> NB X/F buffer ops. 12 1 3950002 458.601 13177.195 1.3
> COM pull force 12 1 1000001 640.120 18392.870 1.8
> Write traj. 12 1 539 17.620 506.289 0.1
> Update 12 1 1000001 1912.108 54941.466 5.5
> Constraints 12 1 1000001 5255.916 151020.645 15.1
> Comm. energies 12 1 100001 916.317 26328.958 2.6
> Rest 300.654 8638.816 0.9
> -----------------------------------------------------------------------------
> Total 34693.203 996855.772 100.0
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
> PME redist. X/F 12 1 2000002 7236.634 207933.530 20.9
> PME spread/gather 12 1 2000002 7925.435 227725.169 22.8
> PME 3D-FFT 12 1 2000002 2616.866 75191.620 7.5
> PME 3D-FFT Comm. 12 1 2000002 1506.177 43277.664 4.3
> PME solve Elec 12 1 1000001 217.415 6247.077 0.6
> -----------------------------------------------------------------------------
>
> Core t (s) Wall t (s) (%)
> Time: 415256.517 34693.203 1196.9
> 9h38:13
> (ns/day) (hour/ns)
> Performance: 4.981 4.818
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list