[gmx-users] GPU slower than I7

Renato Freitas renatoffs at gmail.com
Fri Oct 22 21:20:55 CEST 2010


Hi Szilárd,

Thans for your explanation. Do you know if there will be a new
improvement of PME algorithms to take the full advantage of GPU video
cards?

Do you think that the "NODE" and "Real" time difference could be
attributed to some compilation problem in the mdrun-gpu. Despite I'm
asking this I didn't get any error in the compilation.

Thanks,

Renato

2010/10/22 Szilárd Páll <szilard.pall at cbr.su.se>:
> Hi Renato,
>
> First of all, what you're seeing is pretty normal, especially that you
> have a CPU that is crossing the border of insane :) Why is it normal?
> The PME algorithms are just simply not very well not well suited for
> current GPU architectures. With an ill-suited algorithm you won't be
> able to see the speedups you can often see in other application areas
> - -even more so that you're comparing to Gromacs on a i7 980X. For
> more info + benchmarks see the Gromacs-GPU page:
> http://www.gromacs.org/gpu
>
> However, there is one strange thing you also pointed out. The fact
> that the "NODE" and "Real" time in your mdrun-gpu timing summary is
> not the same, but has 3x deviation is _very_ unusual. I've ran
> mdrun-gpu on quite a wide variety of hardware but I've never seen
> those two counter deviate. It might be an artifact from the cycle
> counters used internally that behave in an unusual way on your CPU.
>
> One other thing I should point out is that you would be better off
> using the standard mdrun which in 4.5 by default has thread-support
> and therefore will run on a single cpu/node without MPI!
>
> Cheers,
> --
> Szilárd
>
>
>
> On Thu, Oct 21, 2010 at 9:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:
>> Hi gromacs users,
>>
>> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
>> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
>> mpi version. Also I compiled the GPU-accelerated
>> version of gromacs. Then I did a  2 ns simulation using a small system
>> (11042 atoms)  to compare the performance of mdrun-gpu vs mdrun_mpi.
>> The results that I got are bellow:
>>
>> ############################################
>> My *.mdp is:
>>
>> constraints         =  all-bonds
>> integrator          =  md
>> dt                  =  0.002    ; ps !
>> nsteps              =  1000000  ; total 2000 ps.
>> nstlist             =  10
>> ns_type             =  grid
>> coulombtype    = PME
>> rvdw                = 0.9
>> rlist               = 0.9
>> rcoulomb            = 0.9
>> fourierspacing      = 0.10
>> pme_order           = 4
>> ewald_rtol          = 1e-5
>> vdwtype             =  cut-off
>> pbc                 =  xyz
>> epsilon_rf    =  0
>> comm_mode           =  linear
>> nstxout             =  1000
>> nstvout             =  0
>> nstfout             =  0
>> nstxtcout           =  1000
>> nstlog              =  1000
>> nstenergy           =  1000
>> ; Berendsen temperature coupling is on in four groups
>> tcoupl              = berendsen
>> tc-grps             = system
>> tau-t               = 0.1
>> ref-t               = 298
>> ; Pressure coupling is on
>> Pcoupl = berendsen
>> pcoupltype = isotropic
>> tau_p = 0.5
>> compressibility = 4.5e-5
>> ref_p = 1.0
>> ; Generate velocites is on at 298 K.
>> gen_vel = no
>>
>> ########################
>> RUNNING GROMACS ON GPU
>>
>> mdrun-gpu -s topol.tpr -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
>> .
>> .
>> .
>>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>  Computing:     Nodes   Number          G-Cycles        Seconds     %
>> ------------------------------------------------------------------------------------------------------
>>  Write traj.    1               1021                    106.075 31.7            0.2
>>  Rest                   1               64125.577               19178.6 99.8
>> ------------------------------------------------------------------------------------------------------
>>  Total          1               64231.652               19210.3 100.0
>> ------------------------------------------------------------------------------------------------------
>>
>>                        NODE (s)                Real (s)                (%)
>>       Time:    6381.840                19210.349               33.2
>>                       1h46:21
>>                        (Mnbf/s)   (MFlops)     (ns/day)        (hour/ns)
>> Performance:    0.000   0.001   27.077  0.886
>>
>> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>>
>> ########################
>> RUNNING GROMACS ON MPI
>>
>> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>>
>>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>  Computing:             Nodes   Number  G-Cycles    Seconds             %
>> --------------------------------------------------------------------------------------------------------------
>>  Domain decomp. 3              100001     1452.166      434.7             0.6
>>  DD comm. load          3              10001        0.745          0.2
>>       0.0
>>  Send X to PME         3              1000001    249.003       74.5
>>          0.1
>>  Comm. coord.           3              1000001   637.329        190.8
>>          0.3
>>  Neighbor search        3              100001     8738.669      2616.0
>>         3.5
>>  Force                       3              1000001   99210.202
>> 29699.2        39.2
>>  Wait + Comm. F       3              1000001   3361.591       1006.3         1.3
>>  PME mesh               3              1000001   66189.554     19814.2
>>       26.2
>>  Wait + Comm. X/F    3              60294.513 8049.5          23.8
>>  Wait + Recv. PME F 3              1000001    801.897        240.1           0.3
>>  Write traj.                 3              1015         33.464
>>  10.0             0.0
>>  Update                     3              1000001    3295.820
>> 986.6          1.3
>>  Constraints              3              1000001     6317.568
>> 1891.2          2.5
>>  Comm. energies       3              100002      70.784          21.2
>>           0.0
>>  Rest                        3                              2314.844
>>    693.0           0.9
>> --------------------------------------------------------------------------------------------------------------
>>  Total                        6              252968.148    75727.5
>>                 100.0
>> --------------------------------------------------------------------------------------------------------------
>> --------------------------------------------------------------------------------------------------------------
>>  PME redist. X/F        3              2000002    1945.551      582.4
>>          0.8
>>  PME spread/gather   3              2000002    37219.607    11141.9        14.7
>>  PME 3D-FFT            3              2000002    21453.362     6422.2
>>        8.5
>>  PME solve               3              1000001     5551.056
>> 1661.7           2.2
>> --------------------------------------------------------------------------------------------------------------
>>
>> Parallel run - timing based on wallclock.
>>
>>                        NODE (s)         Real (s)                    (%)
>>       Time:    12621.257       12621.257           100.0
>>                       3h30:21
>>                        (Mnbf/s)           (GFlops)                (ns/day)              (hour/ns)
>> Performance:    388.633            28.773          13.691         1.753
>> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>>
>> ######################################
>> Comparing the performance values for the two simulations I saw that in
>> "numeric terms" the simulation using the GPU gave (for example) ~27
>> ns/day, while when I used  mpi this value is aproximatelly half (13.7
>> ns/day).
>> However, when I compared the time that each simulation
>> started/finished, the simulation using mpi tooks 211 minutes while the
>> gpu simulation tooked 320 minutes to finish.
>>
>> My questions are:
>>
>> 1. Why in the performace values I got better results with the GPU?
>>
>> 2. Why the simulation running on GPU was 109 min. slower than on 6
>> cores, since my video card is a GTX 480 with 480 gpu cores? I was
>> expecting that the GPU would accelerate greatly the simulations.
>>
>>
>> Does anyone have some idea?
>>
>> Thanks,
>>
>> Renato
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



More information about the gromacs.org_gmx-users mailing list