[gmx-users] low performance 2 GTX 980+ Intel CPU Core i7-5930K 3.5 GHz (2011-3)

Wed Dec 31 16:46:41 CET 2014

Dear everyone,
In order to check if my workstation were able to work with bigger systems, i ran a md simulation of a system of 265175 atoms, but sadly this was it performance with one GPU:

>        P P   -   P M E   L O A D   B A L A N C I N G
>  
>  PP/PME load balancing changed the cut-off and PME settings:
>            particle-particle                    PME
>             rcoulomb  rlist            grid      spacing   1/beta
>    initial  1.400 nm  1.451 nm      96  96  84   0.156 nm  0.448 nm
>    final    1.464 nm  1.515 nm      84  84  80   0.167 nm  0.469 nm
>  cost-ratio           1.14             0.73
>  (note that these numbers concern only part of the total PP and PME load)
>  
>  
> M E G A - F L O P S   A C C O U N T I N G
>  
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>  
>  Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  NB VdW [V&F]                          9330.786612        9330.787     0.0
>  Pair Search distance check           60538.981664      544850.835     0.0
>  NxN Ewald Elec. + LJ [F]          23126654.798080  1526359216.673    96.9
>  NxN Ewald Elec. + LJ [V&F]          234136.147904    25052567.826     1.6
>  1,4 nonbonded interactions           13156.663128     1184099.682     0.1
>  Calc Weights                         39777.045525     1431973.639     0.1
>  Spread Q Bspline                    848576.971200     1697153.942     0.1
>  Gather F Bspline                    848576.971200     5091461.827     0.3
>  3D-FFT                             1079386.516464     8635092.132     0.5
>  Solve PME                              353.070736       22596.527     0.0
>  Shift-X                                331.733925        1990.404     0.0
>  Propers                              13320.966414     3050501.309     0.2
>  Impropers                              340.306806       70783.816     0.0
>  Virial                                1326.365220       23874.574     0.0
>  Stop-CM                                133.117850        1331.178     0.0
>  Calc-Ekin                             2652.280350       71611.569     0.0
>  Lincs                                 4966.549329      297992.960     0.0
>  Lincs-Mat                           111969.439344      447877.757     0.0
>  Constraint-V                         18222.114435      145776.915     0.0
>  Constraint-Vir                        1325.795106       31819.083     0.0
>  Settle                                2763.005259      892450.699     0.1
>  (null)                                 116.802336           0.000     0.0
> -----------------------------------------------------------------------------
>  Total                                              1575064354.133   100.0
> -----------------------------------------------------------------------------
>  
>  
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>  
> On 1 MPI rank, each using 12 OpenMP threads
>  
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Neighbor search        1   12       1251      27.117       1138.913   2.4
>  Launch GPU ops.        1   12      50001       5.444        228.653   0.5
>  Force                  1   12      50001     390.693      16409.109  34.0
>  PME mesh               1   12      50001     443.170      18613.138  38.5
>  Wait GPU local         1   12      50001       8.133        341.590   0.7
>  NB X/F buffer ops.     1   12      98751      30.272       1271.429   2.6
>  Write traj.            1   12         12       1.148         48.198   0.1
>  Update                 1   12      50001      63.980       2687.175   5.6
>  Constraints            1   12      50001     124.709       5237.788  10.8
>  Rest                                          55.169       2317.087   4.8
> -----------------------------------------------------------------------------
>  Total                                       1149.836      48293.079 100.0
> -----------------------------------------------------------------------------
>  Breakdown of PME mesh computation
> -----------------------------------------------------------------------------
>  PME spread/gather      1   12     100002     358.298      15048.493  31.2
>  PME 3D-FFT             1   12     100002      78.270       3287.334   6.8
>  PME solve Elec         1   12      50001       6.221        261.268   0.5
> -----------------------------------------------------------------------------
>  
>  GPU timings
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> -----------------------------------------------------------------------------
>  Pair list H2D                       1251       3.975        3.178     0.5
>  X / q H2D                          50001      36.248        0.725     4.6
>  Nonbonded F kernel                 45000     618.354       13.741    78.7
>  Nonbonded F+ene k.                  3750      72.721       19.392     9.3
>  Nonbonded F+ene+prune k.            1251      28.993       23.176     3.7
>  F D2H                              50001      25.267        0.505     3.2
> -----------------------------------------------------------------------------
>  Total                                        785.559       15.711   100.0
> -----------------------------------------------------------------------------
>  
> Force evaluation time GPU/CPU: 15.711 ms/16.677 ms = 0.942
> For optimal performance this ratio should be close to 1!
>  
>                Core t (s)   Wall t (s)        (%)
>        Time:    13663.176     1149.836     1188.3
>                  (ns/day)    (hour/ns)
> Performance:        7.514        3.194
> Finished mdrun on rank 0 Wed Dec 31 01:44:22 2014

i also noticed this at the beginning:

> step   80: timed with pme grid 96 96 84, coulomb cutoff 1.400: 3287.7 M-cycles
> step  160: timed with pme grid 84 84 80, coulomb cutoff 1.464: 3180.2 M-cycles
> step  240: timed with pme grid 72 72 72, coulomb cutoff 1.708: 3948.2 M-cycles
> step  320: timed with pme grid 96 96 84, coulomb cutoff 1.400: 3319.4 M-cycles
> step  400: timed with pme grid 96 96 80, coulomb cutoff 1.435: 3213.8 M-cycles
> step  480: timed with pme grid 84 84 80, coulomb cutoff 1.464: 3194.6 M-cycles
> step  560: timed with pme grid 80 80 80, coulomb cutoff 1.537: 3343.4 M-cycles
> step  640: timed with pme grid 80 80 72, coulomb cutoff 1.594: 3571.9 M-cycles
>               optimal pme grid 84 84 80, coulomb cutoff 1.464
>            Step           Time         Lambda
>            5000       10.00000        0.00000

and when i add the second graphic card the performance drop again to about 5-6ns/day

This performance is really weird, because i got about ~5 ns/day in a different work station  (GTX 770 and a i7-4770).
Is there something i’m missing regarding the correct use of a second GPU?
Kind regards and a happy new year to everyone,
Carlos
--  
Carlos Navarro Retamal
Bioinformatic engineer
Ph.D(c) in Applied Science, Universidad de Talca, Chile
Center of Bioinformatics and Molecular Simulations (CBSM)
Universidad de Talca
2 Norte 685, Casilla 721, Talca - Chile   
Teléfono: 56-71-201 798,  
Fax: 56-71-201 561
Email: carlos.navarro87 at gmail.com or cnavarro at utalca.cl

On Tuesday, December 30, 2014 at 4:58 PM, Carlos Navarro Retamal wrote:

> Dear Justin (and everyone)
> I tried using -pin parameters as following:
> mdrun -nt 6 -pin on -pinoffset 0 -gpu_id 0 -deffnm test1 &
> mdrun -nt 6 -pin on -pinoffset 7 -gpu_id 1 -deffnm test2 &
>  
> The performance increase a little bit (from ~17ns/day to 22ns/day) but i got the same warning message:
>  
> > Force evaluation time GPU/CPU: 2.206 ms/4.462 ms = 0.494
> > For optimal performance this ratio should be close to 1!
> >  
> >  
> > NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> >       performance loss.
> >  
> >                Core t (s)   Wall t (s)        (%)
> >        Time:     2301.995      386.857      595.1
> >                  (ns/day)    (hour/ns)
> > Performance:       22.334        1.075
> > Finished mdrun on rank 0 Tue Dec 30 16:49:14 2014
> >  
>  
>  
>  
> looking also into the nvidia settings, i saw that when i run only one mdrun process, the respective graphic card that is working on it showed a ~50% of performance, but when i run a new mdrun instance, the performance drop to about ~30% on each each graphic card ( i set a maximum performance mode).
> Do you think this may be the problem?. and if this IS the issue, is there a way to solve it?
> Kind regards,
> Carlos
>  
> --  
> Carlos Navarro Retamal
> Bioinformatic engineer
> Ph.D(c) in Applied Science, Universidad de Talca, Chile
> Center of Bioinformatics and Molecular Simulations (CBSM)
> Universidad de Talca
> 2 Norte 685, Casilla 721, Talca - Chile   
> Teléfono: 56-71-201 798,  
> Fax: 56-71-201 561
> Email: carlos.navarro87 at gmail.com (mailto:carlos.navarro87 at gmail.com) or cnavarro at utalca.cl (mailto:cnavarro at utalca.cl)
>  
>  
> On Tuesday, December 30, 2014 at 4:03 PM, Carlos Navarro Retamal wrote:
>  
> > Dear Justin,
> > Thanks a lot for your reply.
> > I tried with another system (~130k with the same results) 1GPU > 2 GPU’s
> > In any case i’ll read the documentation you mention on your previous reply (and hopefully i’ll be able to run 2 process simultaneously .
> > Kind regards,
> > Carlos
> >  
> >  
> > --  
> > Carlos Navarro Retamal
> > Bioinformatic engineer
> > Ph.D(c) in Applied Science, Universidad de Talca, Chile
> > Center of Bioinformatics and Molecular Simulations (CBSM)
> > Universidad de Talca
> > 2 Norte 685, Casilla 721, Talca - Chile   
> > Teléfono: 56-71-201 798,  
> > Fax: 56-71-201 561
> > Email: carlos.navarro87 at gmail.com (mailto:carlos.navarro87 at gmail.com) or cnavarro at utalca.cl (mailto:cnavarro at utalca.cl)
> >  
> >  
> > On Tuesday, December 30, 2014 at 3:40 PM, Justin Lemkul wrote:
> >  
> > >  
> > >  
> > > On 12/30/14 1:38 PM, Carlos Navarro Retamal wrote:
> > > > Dear Justin,
> > > > Thanks a lot for your reply.
> > > >  
> > > > > You can use the multiple cards to run
> > > > > concurrent simulations on each (provided cooling is adequate to do this).
> > > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > > I tried that. I launched 2 simulation at the same time, but on each i got the next warning at the end:
> > > >  
> > > > > Force evaluation time GPU/CPU: 3.177 ms/5.804 ms = 0.547
> > > > > For optimal performance this ratio should be close to 1!
> > > > >  
> > > > >  
> > > > > NOTE: The GPU has >25% less load than the CPU. This imbalance cause
> > > > > performance loss.
> > > > >  
> > > >  
> > > >  
> > > > and i got really low performance ( ~ 17ns/day each)
> > > > using the following commands:
> > > >  
> > > > > mdrun -deffnm test1 -gpu_id 0 -v
> > > > > mdrun -deffnm test2 -gpu_id 1 -v
> > > > >  
> > > >  
> > > >  
> > > >  
> > > >  
> > > > Is there a better way to perform multiple md simulations at the same time?
> > >  
> > > In this case, both GPU are probably fighting for CPU resources (note how the CPU  
> > > force evaluation is the limiting factor). You'll need to set -pin and  
> > > -pinoffset suitably, IIRC. See discussion at
> > >  
> > > http://www.gromacs.org/Documentation/Acceleration_and_parallelization
> > >  
> > > -Justin
> > >  
> > > --  
> > > ==================================================
> > >  
> > > Justin A. Lemkul, Ph.D.
> > > Ruth L. Kirschstein NRSA Postdoctoral Fellow
> > >  
> > > Department of Pharmaceutical Sciences
> > > School of Pharmacy
> > > Health Sciences Facility II, Room 629
> > > University of Maryland, Baltimore
> > > 20 Penn St.
> > > Baltimore, MD 21201
> > >  
> > > jalemkul at outerbanks.umaryland.edu (mailto:jalemkul at outerbanks.umaryland.edu) | (410) 706-7441
> > > http://mackerell.umaryland.edu/~jalemkul
> > >  
> > > ==================================================
> > > --  
> > > Gromacs Users mailing list
> > >  
> > > * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> > >  
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >  
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org (mailto:gmx-users-request at gromacs.org).
> > >  
> > >  
> > >  
> >  
> >  
>