[gmx-users] GPU slower than I7

Roland Schulz roland at utk.edu
Thu Oct 21 22:56:02 CEST 2010


On Thu, Oct 21, 2010 at 3:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:

> Hi gromacs users,
>
> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
> mpi version. Also I compiled the GPU-accelerated
> version of gromacs. Then I did a  2 ns simulation using a small system
> (11042 atoms)  to compare the performance of mdrun-gpu vs mdrun_mpi.
> The results that I got are bellow:
>
> ############################################
> My *.mdp is:
>
> constraints         =  all-bonds
> integrator          =  md
> dt                  =  0.002    ; ps !
> nsteps              =  1000000  ; total 2000 ps.
> nstlist             =  10
> ns_type             =  grid
> coulombtype    = PME
> rvdw                = 0.9
> rlist               = 0.9
> rcoulomb            = 0.9
> fourierspacing      = 0.10
> pme_order           = 4
> ewald_rtol          = 1e-5
> vdwtype             =  cut-off
> pbc                 =  xyz
> epsilon_rf    =  0
> comm_mode           =  linear
> nstxout             =  1000
> nstvout             =  0
> nstfout             =  0
> nstxtcout           =  1000
> nstlog              =  1000
> nstenergy           =  1000
> ; Berendsen temperature coupling is on in four groups
> tcoupl              = berendsen
> tc-grps             = system
> tau-t               = 0.1
> ref-t               = 298
> ; Pressure coupling is on
> Pcoupl = berendsen
> pcoupltype = isotropic
> tau_p = 0.5
> compressibility = 4.5e-5
> ref_p = 1.0
> ; Generate velocites is on at 298 K.
> gen_vel = no
>
> ########################
> RUNNING GROMACS ON GPU
>
> mdrun-gpu -s topol.tpr -v > & out &
>
> Here is a part of the md.log:
>
> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
> .
> .
> .
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:     Nodes   Number          G-Cycles        Seconds     %
>
> ------------------------------------------------------------------------------------------------------
>  Write traj.    1               1021                    106.075 31.7
>      0.2
>  Rest                   1               64125.577               19178.6
> 99.8
>
> ------------------------------------------------------------------------------------------------------
>  Total          1               64231.652               19210.3 100.0
>
> ------------------------------------------------------------------------------------------------------
>
>                        NODE (s)                Real (s)                (%)
>       Time:    6381.840                19210.349               33.2
>                       1h46:21
>                        (Mnbf/s)   (MFlops)     (ns/day)        (hour/ns)
> Performance:    0.000   0.001   27.077  0.886
>
> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>
> ########################
> RUNNING GROMACS ON MPI
>
> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>
> Here is a part of the md.log:
>
> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:             Nodes   Number  G-Cycles    Seconds             %
>
> --------------------------------------------------------------------------------------------------------------
>  Domain decomp. 3              100001     1452.166      434.7
> 0.6
>  DD comm. load          3              10001        0.745          0.2
>       0.0
>  Send X to PME         3              1000001    249.003       74.5
>          0.1
>  Comm. coord.           3              1000001   637.329        190.8
>          0.3
>  Neighbor search        3              100001     8738.669      2616.0
>         3.5
>  Force                       3              1000001   99210.202
> 29699.2        39.2
>  Wait + Comm. F       3              1000001   3361.591       1006.3
>   1.3
>  PME mesh               3              1000001   66189.554     19814.2
>       26.2
>  Wait + Comm. X/F    3              60294.513 8049.5          23.8
>  Wait + Recv. PME F 3              1000001    801.897        240.1
>   0.3
>  Write traj.                 3              1015         33.464
>  10.0             0.0
>  Update                     3              1000001    3295.820
> 986.6          1.3
>  Constraints              3              1000001     6317.568
> 1891.2          2.5
>  Comm. energies       3              100002      70.784          21.2
>           0.0
>  Rest                        3                              2314.844
>    693.0           0.9
>
> --------------------------------------------------------------------------------------------------------------
>  Total                        6              252968.148    75727.5
>                 100.0
>
> --------------------------------------------------------------------------------------------------------------
>
> --------------------------------------------------------------------------------------------------------------
>  PME redist. X/F        3              2000002    1945.551      582.4
>          0.8
>  PME spread/gather   3              2000002    37219.607    11141.9
>  14.7
>  PME 3D-FFT            3              2000002    21453.362     6422.2
>        8.5
>  PME solve               3              1000001     5551.056
> 1661.7           2.2
>
> --------------------------------------------------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
>                        NODE (s)         Real (s)                    (%)
>       Time:    12621.257       12621.257           100.0
>                       3h30:21
>                        (Mnbf/s)           (GFlops)                (ns/day)
>              (hour/ns)
> Performance:    388.633            28.773          13.691         1.753
> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>
> ######################################
> Comparing the performance values for the two simulations I saw that in
> "numeric terms" the simulation using the GPU gave (for example) ~27
> ns/day, while when I used  mpi this value is aproximatelly half (13.7
> ns/day).
> However, when I compared the time that each simulation
> started/finished, the simulation using mpi tooks 211 minutes while the
> gpu simulation tooked 320 minutes to finish.
>
> My questions are:
>
> 1. Why in the performace values I got better results with the GPU?
>
Your CPU version probably can be optimized a bit. You should use HT and run
on 12. Make sure PME/PP is balanced and use the best rlist/fourier_spacing
ratio. Also your PME accuracy is rather high. Make sure you need that (.11
fourier spacing should be accurate enough for rlist of .9). Your PME node
spent 23% waiting on the PP nodes.

>
> 2. Why the simulation running on GPU was 109 min. slower than on 6
> cores, since my video card is a GTX 480 with 480 gpu cores? I was
> expecting that the GPU would accelerate greatly the simulations.
>
The output you posted says the GPU version was faster (running only for
106min) The CPU cores are much more powerful. I would expect  them to be at
about as fast as the GPU.

Roland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20101021/d0461e2a/attachment.html>


More information about the gromacs.org_gmx-users mailing list