[gmx-users] GPU slower than I7

Thu Oct 21 23:53:50 CEST 2010

Thanks Roland. I will do a newer test using the fourier spacing equal
to 0.11. However, about the performance of GPU versus CPU (mpi) let me
try to explain it better:

The simulation using gromacs with GPU started and finished:

Started mdrun on node 0 Wed Oct 20 09:52:09 2010
Finished mdrun on node 0 Wed Oct 20 15:12:19 2010

Total time = 320 min

The simulation using gromacs with mpi started and finished:

Started mdrun on node 0 Wed Oct 20 18:30:52 2010
Finished mdrun on node 0 Wed Oct 20 22:01:14 2010

Total time = 211 min

Based on this numbers, it was the CPU with mpi that was faster than
the GPU, by aproximately 106 min. But looking at the end of each
output I have:

GPU

             NODE (s)                Real (s)                (%)
Time:    6381.840                19210.349            33.2
                            1h46:21
                         (Mnbf/s)   (MFlops)     (ns/day)        (hour/ns)
Performance:    0.000       0.001          27.077          0.886

MPI

             NODE (s)         Real (s)                    (%)
Time:    12621.257       12621.257               100.0
                           3h30:21
                      (Mnbf/s)      (GFlops)     (ns/day)        (hour/ns)
Performance: 388.633      28.773        13.691         1.753

Looking abobe we can see that the gromacs prints in the output that
the simulation is faster when the GPU is used. But this is not the
reality. The truth is that simulation time with MPI was 106 min faster
thatn that with GPU. It seems correct to you? As I said before, I was
expecting that GPU should take a lower time than the 6 core MPI.

Thanks,

Renato

2010/10/21 Roland Schulz <roland at utk.edu>:
>
>
> On Thu, Oct 21, 2010 at 3:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:
>>
>> Hi gromacs users,
>>
>> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
>> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
>> mpi version. Also I compiled the GPU-accelerated
>> version of gromacs. Then I did a  2 ns simulation using a small system
>> (11042 atoms)  to compare the performance of mdrun-gpu vs mdrun_mpi.
>> The results that I got are bellow:
>>
>> ############################################
>> My *.mdp is:
>>
>> constraints         =  all-bonds
>> integrator          =  md
>> dt                  =  0.002    ; ps !
>> nsteps              =  1000000  ; total 2000 ps.
>> nstlist             =  10
>> ns_type             =  grid
>> coulombtype    = PME
>> rvdw                = 0.9
>> rlist               = 0.9
>> rcoulomb            = 0.9
>> fourierspacing      = 0.10
>> pme_order           = 4
>> ewald_rtol          = 1e-5
>> vdwtype             =  cut-off
>> pbc                 =  xyz
>> epsilon_rf    =  0
>> comm_mode           =  linear
>> nstxout             =  1000
>> nstvout             =  0
>> nstfout             =  0
>> nstxtcout           =  1000
>> nstlog              =  1000
>> nstenergy           =  1000
>> ; Berendsen temperature coupling is on in four groups
>> tcoupl              = berendsen
>> tc-grps             = system
>> tau-t               = 0.1
>> ref-t               = 298
>> ; Pressure coupling is on
>> Pcoupl = berendsen
>> pcoupltype = isotropic
>> tau_p = 0.5
>> compressibility = 4.5e-5
>> ref_p = 1.0
>> ; Generate velocites is on at 298 K.
>> gen_vel = no
>>
>> ########################
>> RUNNING GROMACS ON GPU
>>
>> mdrun-gpu -s topol.tpr -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
>> .
>> .
>> .
>>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>  Computing:     Nodes   Number          G-Cycles        Seconds     %
>>
>> ------------------------------------------------------------------------------------------------------
>>  Write traj.    1               1021                    106.075 31.7
>>      0.2
>>  Rest                   1               64125.577               19178.6
>> 99.8
>>
>> ------------------------------------------------------------------------------------------------------
>>  Total          1               64231.652               19210.3 100.0
>>
>> ------------------------------------------------------------------------------------------------------
>>
>>                        NODE (s)                Real (s)                (%)
>>       Time:    6381.840                19210.349               33.2
>>                       1h46:21
>>                        (Mnbf/s)   (MFlops)     (ns/day)        (hour/ns)
>> Performance:    0.000   0.001   27.077  0.886
>>
>> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>>
>> ########################
>> RUNNING GROMACS ON MPI
>>
>> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>>
>>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>  Computing:             Nodes   Number  G-Cycles    Seconds             %
>>
>> --------------------------------------------------------------------------------------------------------------
>>  Domain decomp. 3              100001     1452.166      434.7
>> 0.6
>>  DD comm. load          3              10001        0.745          0.2
>>       0.0
>>  Send X to PME         3              1000001    249.003       74.5
>>          0.1
>>  Comm. coord.           3              1000001   637.329        190.8
>>          0.3
>>  Neighbor search        3              100001     8738.669      2616.0
>>         3.5
>>  Force                       3              1000001   99210.202
>> 29699.2        39.2
>>  Wait + Comm. F       3              1000001   3361.591       1006.3
>>   1.3
>>  PME mesh               3              1000001   66189.554     19814.2
>>       26.2
>>  Wait + Comm. X/F    3              60294.513 8049.5          23.8
>>  Wait + Recv. PME F 3              1000001    801.897        240.1
>>   0.3
>>  Write traj.                 3              1015         33.464
>>  10.0             0.0
>>  Update                     3              1000001    3295.820
>> 986.6          1.3
>>  Constraints              3              1000001     6317.568
>> 1891.2          2.5
>>  Comm. energies       3              100002      70.784          21.2
>>           0.0
>>  Rest                        3                              2314.844
>>    693.0           0.9
>>
>> --------------------------------------------------------------------------------------------------------------
>>  Total                        6              252968.148    75727.5
>>                 100.0
>>
>> --------------------------------------------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------------------------------------------
>>  PME redist. X/F        3              2000002    1945.551      582.4
>>          0.8
>>  PME spread/gather   3              2000002    37219.607    11141.9
>>  14.7
>>  PME 3D-FFT            3              2000002    21453.362     6422.2
>>        8.5
>>  PME solve               3              1000001     5551.056
>> 1661.7           2.2
>>
>> --------------------------------------------------------------------------------------------------------------
>>
>> Parallel run - timing based on wallclock.
>>
>>                        NODE (s)         Real (s)                    (%)
>>       Time:    12621.257       12621.257           100.0
>>                       3h30:21
>>                        (Mnbf/s)           (GFlops)                (ns/day)
>>              (hour/ns)
>> Performance:    388.633            28.773          13.691         1.753
>> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>>
>> ######################################
>> Comparing the performance values for the two simulations I saw that in
>> "numeric terms" the simulation using the GPU gave (for example) ~27
>> ns/day, while when I used  mpi this value is aproximatelly half (13.7
>> ns/day).
>> However, when I compared the time that each simulation
>> started/finished, the simulation using mpi tooks 211 minutes while the
>> gpu simulation tooked 320 minutes to finish.
>>
>> My questions are:
>>
>> 1. Why in the performace values I got better results with the GPU?
>
> Your CPU version probably can be optimized a bit. You should use HT and run
> on 12. Make sure PME/PP is balanced and use the best rlist/fourier_spacing
> ratio. Also your PME accuracy is rather high. Make sure you need that (.11
> fourier spacing should be accurate enough for rlist of .9). Your PME node
> spent 23% waiting on the PP nodes.
>>
>> 2. Why the simulation running on GPU was 109 min. slower than on 6
>> cores, since my video card is a GTX 480 with 480 gpu cores? I was
>> expecting that the GPU would accelerate greatly the simulations.
>
> The output you posted says the GPU version was faster (running only for
> 106min) The CPU cores are much more powerful. I would expect  them to be at
> about as fast as the GPU.
> Roland
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>