[gmx-users] low performance 2 GTX 980+ Intel CPU Core i7-5930K 3.5 GHz (2011-3)

Tue Dec 30 17:50:55 CET 2014

Dear gromacs users,
I add now the log file of the most important test (2 GPU’s-> 1 job and 1GPU-> 1job)

Both GPU’s:

command:
mdrun -v
file: http://cl.ly/3w1C2k1X2J2W

>        P P   -   P M E   L O A D   B A L A N C I N G
>  
>  PP/PME load balancing changed the cut-off and PME settings:
>            particle-particle                    PME
>             rcoulomb  rlist            grid      spacing   1/beta
>    initial  0.800 nm  0.861 nm      96  42  44   0.157 nm  0.256 nm
>    final    1.304 nm  1.365 nm      60  25  28   0.261 nm  0.418 nm
>  cost-ratio           3.99             0.24
>  (note that these numbers concern only part of the total PP and PME load)
>  
>     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>  
>  av. #atoms communicated per step for force:  2 x 16236.7
>  av. #atoms communicated per step for vsites: 3 x 282.4
>  av. #atoms communicated per step for LINCS:  2 x 876.5
>  
>  Average load imbalance: 4.4 %
>  Part of the total run time spent waiting due to load imbalance: 0.5 %
>  
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>  
> On 2 MPI ranks, each using 6 OpenMP threads
>  
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.         2    6       2000      11.978        503.055   4.8
>  DD comm. load          2    6        472       0.003          0.141   0.0
>  Vsite constr.          2    6      50001       7.975        334.960   3.2
>  Neighbor search        2    6       2001      11.470        481.752   4.6
>  Launch GPU ops.        2    6     100002       5.066        212.761   2.0
>  Comm. coord.           2    6      48000       3.355        140.897   1.4
>  Force                  2    6      50001      23.603        991.346   9.5
>  Wait + Comm. F         2    6      50001       3.747        157.374   1.5
>  PME mesh               2    6      50001     103.250       4336.481  41.6
>  Wait GPU nonlocal      2    6      50001       2.200         92.408   0.9
>  Wait GPU local         2    6      50001       0.119          5.015   0.0
>  NB X/F buffer ops.     2    6     196002      10.840        455.295   4.4
>  Vsite spread           2    6      52002      13.895        583.601   5.6
>  Write traj.            2    6          6       0.098          4.137   0.0
>  Update                 2    6      50001      17.625        740.244   7.1
>  Constraints            2    6      50001      22.554        947.251   9.1
>  Comm. energies         2    6       2001       0.025          1.035   0.0
>  Rest                                          10.193        428.120   4.1
> -----------------------------------------------------------------------------
>  Total                                        247.997      10415.874 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME redist. X/F        2    6     100002      19.102        802.277   7.7
>  PME spread/gather      2    6     100002      75.116       3154.875  30.3
>  PME 3D-FFT             2    6     100002       6.673        280.281   2.7
>  PME 3D-FFT Comm.       2    6     100002       1.437         60.365   0.6
>  PME solve Elec         2    6      50001       0.627         26.314   0.3
> -----------------------------------------------------------------------------
>  
>                Core t (s)   Wall t (s)        (%)
>        Time:     2951.251      247.997     1190.0
>                  (ns/day)    (hour/ns)
> Performance:       34.840        0.689
> Finished mdrun on rank 0 Tue Dec 30 10:49:37 2014

1 GPU:
command:
mdrun -gpu_id 0 -v
file: http://cl.ly/1y3V0j2f263U

>        P P   -   P M E   L O A D   B A L A N C I N G
>  
>  PP/PME load balancing changed the cut-off and PME settings:
>            particle-particle                    PME
>             rcoulomb  rlist            grid      spacing   1/beta
>    initial  0.800 nm  0.861 nm      96  42  44   0.157 nm  0.256 nm
>    final    0.906 nm  0.967 nm      84  36  40   0.181 nm  0.290 nm
>  cost-ratio           1.42             0.68
>  (note that these numbers concern only part of the total PP and PME load)
>  
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>  
> On 1 MPI rank, each using 12 OpenMP threads
>  
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Vsite constr.          1   12      50001       5.023        210.973   2.4
>  Neighbor search        1   12       2001      10.913        458.333   5.1
>  Launch GPU ops.        1   12      50001       3.921        164.665   1.8
>  Force                  1   12      50001      18.450        774.912   8.7
>  PME mesh               1   12      50001      97.768       4106.267  46.0
>  Wait GPU local         1   12      50001       4.037        169.543   1.9
>  NB X/F buffer ops.     1   12      98001       9.479        398.121   4.5
>  Vsite spread           1   12      52002       6.739        283.049   3.2
>  Write traj.            1   12          6       0.182          7.663   0.1
>  Update                 1   12      50001      18.434        774.236   8.7
>  Constraints            1   12      50001      20.438        858.380   9.6
>  Rest                                          17.040        715.679   8.0
> -----------------------------------------------------------------------------
>  Total                                        212.424       8921.821 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME spread/gather      1   12     100002      80.808       3393.931  38.0
>  PME 3D-FFT             1   12     100002      15.250        640.517   7.2
>  PME solve Elec         1   12      50001       1.270         53.334   0.6
> -----------------------------------------------------------------------------
>  
>  GPU timings
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> -----------------------------------------------------------------------------
>  Pair list H2D                       2001       1.362        0.681     1.2
>  X / q H2D                          50001      12.740        0.255    10.9
>  Nonbonded F kernel                 48000      87.147        1.816    74.6
>  Nonbonded F+ene+prune k.            2001       6.513        3.255     5.6
>  F D2H                              50001       9.063        0.181     7.8
> -----------------------------------------------------------------------------
>  Total                                        116.825        2.336   100.0
> -----------------------------------------------------------------------------
>  
> Force evaluation time GPU/CPU: 2.336 ms/2.324 ms = 1.005
> For optimal performance this ratio should be close to 1!
>  
>                Core t (s)   Wall t (s)        (%)
>        Time:     2524.140      212.424     1188.3
>                  (ns/day)    (hour/ns)
> Performance:       40.674        0.590
> Finished mdrun on rank 0 Tue Dec 30 10:43:59 2014

kind regards,
Carlos

--  
Carlos Navarro Retamal
Bioinformatic engineer
Ph.D(c) in Applied Science, Universidad de Talca, Chile
Center of Bioinformatics and Molecular Simulations (CBSM)
Universidad de Talca
2 Norte 685, Casilla 721, Talca - Chile   
Teléfono: 56-71-201 798,  
Fax: 56-71-201 561
Email: carlos.navarro87 at gmail.com or cnavarro at utalca.cl

On Monday, December 29, 2014 at 9:43 PM, Carlos Navarro Retamal wrote:

> Dear gromacs users,  
> I just recently bought a workstation that posees two GTX 980 plus an i7 (Intel CPU Core i7-5930K 3.5 GHz (2011-3)).
> In order to test it, i run a MD simulation of a system containing ~90k atoms.
> These are the performances:
>  
> 2 GPU’s (1 job):
> 34ns/day (each cards were working about ~40%)
>  
> 1 GPU (Nº1)  (1 job):
> 37ns/day (~65% of performance)
>  
> 1 GPU (Nº2) (1 job):
> 36ns/day (~65% of performance)
>  
> 2 GPU’s (2 jobs simultaneously )
> 16ns/day and 16ns/day respectively. (~20% of performance each)
>  
> With respect to the last test, the .log file show the next message:
>  
> Force evaluation time GPU/CPU: 3.177 ms/5.804 ms = 0.547
> For optimal performance this ratio should be close to 1!
>  
>  
> NOTE: The GPU has >25% less load than the CPU. This imbalance cause
>       performance loss.
>  
>  
> So probably, since all the cpu is splitting between each job, the ratio GPU/CPU will be worse.
>  
> Is there a way i can solve this issue (is kind of sad that i’m getting a better performance with one GPU instead of two, since i saw that when i add a third or even a fourth one the performance start to decrease).
> Here’s my .mdp file:
>  
> > title       = Protein-ligand complex MD simulation  
> > ; Run parameters
> > integrator  = md        ; leap-frog integrator
> > nsteps      = 15000000    ; 2 * 1500000 = 30000 ps (30ns)
> > dt          = 0.002     ; 2 fs
> > ; Output control
> > nstxout     = 0         ; suppress .trr output  
> > nstvout     = 0         ; suppress .trr output
> > nstenergy   = 10000      ; save energies every 2 ps
> > nstlog      = 10000      ; update log file every 2 ps
> > nstxtcout   = 15000      ; write .xtc trajectory every 2 ps
> > energygrps  = Protein non-Protein
> > ; Bond parameters
> > continuation    = yes           ; first dynamics run
> > constraint_algorithm = lincs    ; holonomic constraints  
> > constraints     = all-bonds     ; all bonds (even heavy atom-H bonds) c
> > lincs_iter      = 1             ; accuracy of LINCS
> > lincs_order     = 4             ; also related to accuracy
> > ; Neighborsearching
> > ns_type     = grid      ; search neighboring grid cells
> > nstlist     = 10         ; 10 fs
> > cutoff-scheme = Verlet
> > rlist       = 1.0       ; short-range neighborlist cutoff (in nm)
> > rcoulomb    = 1.0       ; short-range electrostatic cutoff (in nm)
> > rvdw        = 1.0       ; short-range van der Waals cutoff (in nm)
> > ; Electrostatics
> > coulombtype     = PME       ; Particle Mesh Ewald for long-range electr
> > pme_order       = 4         ; cubic interpolation
> > fourierspacing  = 0.16      ; grid spacing for FFT
> > ; Temperature coupling
> > tcoupl      = V-rescale                     ; modified Berendsen thermo
> > tc-grps     = Protein non-Protein    ; two coupling groups - more accur
> > tau_t       = 0.1   0.1                       ; time constant, in ps
> > ref_t       = 300   300                       ; reference temperature,  
> > ; Pressure coupling  
> > pcoupl      = Parrinello-Rahman             ; pressure coupling is on f
> > pcoupltype  = isotropic                     ; uniform scaling of box ve
> > tau_p       = 2.0                           ; time constant, in ps
> > ref_p       = 1.0                           ; reference pressure, in ba
> > compressibility = 4.5e-5                    ; isothermal compressibilit
> > ; Periodic boundary conditions
> > pbc         = xyz       ; 3-D PBC
> > ; Dispersion correction
> > DispCorr    = EnerPres  ; account for cut-off vdW scheme
> > ; Velocity generation
> > gen_vel     = no        ; assign velocities from Maxwell distribution
> >  
>  
>   
> Kind regards,
> Carlos
> --  
> Carlos Navarro Retamal
> Bioinformatic engineer
> Ph.D(c) in Applied Science, Universidad de Talca, Chile
> Center of Bioinformatics and Molecular Simulations (CBSM)
> Universidad de Talca
> 2 Norte 685, Casilla 721, Talca - Chile   
> Teléfono: 56-71-201 798,  
> Fax: 56-71-201 561
> Email: carlos.navarro87 at gmail.com (mailto:carlos.navarro87 at gmail.com) or cnavarro at utalca.cl (mailto:cnavarro at utalca.cl)
>