[gmx-users] GPU slower than I7
Renato Freitas
renatoffs at gmail.com
Thu Oct 21 23:53:50 CEST 2010
Thanks Roland. I will do a newer test using the fourier spacing equal
to 0.11. However, about the performance of GPU versus CPU (mpi) let me
try to explain it better:
The simulation using gromacs with GPU started and finished:
Started mdrun on node 0 Wed Oct 20 09:52:09 2010
Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
Total time = 320 min
The simulation using gromacs with mpi started and finished:
Started mdrun on node 0 Wed Oct 20 18:30:52 2010
Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
Total time = 211 min
Based on this numbers, it was the CPU with mpi that was faster than
the GPU, by aproximately 106 min. But looking at the end of each
output I have:
GPU
NODE (s) Real (s) (%)
Time: 6381.840 19210.349 33.2
1h46:21
(Mnbf/s) (MFlops) (ns/day) (hour/ns)
Performance: 0.000 0.001 27.077 0.886
MPI
NODE (s) Real (s) (%)
Time: 12621.257 12621.257 100.0
3h30:21
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 388.633 28.773 13.691 1.753
Looking abobe we can see that the gromacs prints in the output that
the simulation is faster when the GPU is used. But this is not the
reality. The truth is that simulation time with MPI was 106 min faster
thatn that with GPU. It seems correct to you? As I said before, I was
expecting that GPU should take a lower time than the 6 core MPI.
Thanks,
Renato
2010/10/21 Roland Schulz <roland at utk.edu>:
>
>
> On Thu, Oct 21, 2010 at 3:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:
>>
>> Hi gromacs users,
>>
>> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
>> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
>> mpi version. Also I compiled the GPU-accelerated
>> version of gromacs. Then I did a 2 ns simulation using a small system
>> (11042 atoms) to compare the performance of mdrun-gpu vs mdrun_mpi.
>> The results that I got are bellow:
>>
>> ############################################
>> My *.mdp is:
>>
>> constraints = all-bonds
>> integrator = md
>> dt = 0.002 ; ps !
>> nsteps = 1000000 ; total 2000 ps.
>> nstlist = 10
>> ns_type = grid
>> coulombtype = PME
>> rvdw = 0.9
>> rlist = 0.9
>> rcoulomb = 0.9
>> fourierspacing = 0.10
>> pme_order = 4
>> ewald_rtol = 1e-5
>> vdwtype = cut-off
>> pbc = xyz
>> epsilon_rf = 0
>> comm_mode = linear
>> nstxout = 1000
>> nstvout = 0
>> nstfout = 0
>> nstxtcout = 1000
>> nstlog = 1000
>> nstenergy = 1000
>> ; Berendsen temperature coupling is on in four groups
>> tcoupl = berendsen
>> tc-grps = system
>> tau-t = 0.1
>> ref-t = 298
>> ; Pressure coupling is on
>> Pcoupl = berendsen
>> pcoupltype = isotropic
>> tau_p = 0.5
>> compressibility = 4.5e-5
>> ref_p = 1.0
>> ; Generate velocites is on at 298 K.
>> gen_vel = no
>>
>> ########################
>> RUNNING GROMACS ON GPU
>>
>> mdrun-gpu -s topol.tpr -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
>> .
>> .
>> .
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Number G-Cycles Seconds %
>>
>> ------------------------------------------------------------------------------------------------------
>> Write traj. 1 1021 106.075 31.7
>> 0.2
>> Rest 1 64125.577 19178.6
>> 99.8
>>
>> ------------------------------------------------------------------------------------------------------
>> Total 1 64231.652 19210.3 100.0
>>
>> ------------------------------------------------------------------------------------------------------
>>
>> NODE (s) Real (s) (%)
>> Time: 6381.840 19210.349 33.2
>> 1h46:21
>> (Mnbf/s) (MFlops) (ns/day) (hour/ns)
>> Performance: 0.000 0.001 27.077 0.886
>>
>> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>>
>> ########################
>> RUNNING GROMACS ON MPI
>>
>> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Number G-Cycles Seconds %
>>
>> --------------------------------------------------------------------------------------------------------------
>> Domain decomp. 3 100001 1452.166 434.7
>> 0.6
>> DD comm. load 3 10001 0.745 0.2
>> 0.0
>> Send X to PME 3 1000001 249.003 74.5
>> 0.1
>> Comm. coord. 3 1000001 637.329 190.8
>> 0.3
>> Neighbor search 3 100001 8738.669 2616.0
>> 3.5
>> Force 3 1000001 99210.202
>> 29699.2 39.2
>> Wait + Comm. F 3 1000001 3361.591 1006.3
>> 1.3
>> PME mesh 3 1000001 66189.554 19814.2
>> 26.2
>> Wait + Comm. X/F 3 60294.513 8049.5 23.8
>> Wait + Recv. PME F 3 1000001 801.897 240.1
>> 0.3
>> Write traj. 3 1015 33.464
>> 10.0 0.0
>> Update 3 1000001 3295.820
>> 986.6 1.3
>> Constraints 3 1000001 6317.568
>> 1891.2 2.5
>> Comm. energies 3 100002 70.784 21.2
>> 0.0
>> Rest 3 2314.844
>> 693.0 0.9
>>
>> --------------------------------------------------------------------------------------------------------------
>> Total 6 252968.148 75727.5
>> 100.0
>>
>> --------------------------------------------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------------------------------------------
>> PME redist. X/F 3 2000002 1945.551 582.4
>> 0.8
>> PME spread/gather 3 2000002 37219.607 11141.9
>> 14.7
>> PME 3D-FFT 3 2000002 21453.362 6422.2
>> 8.5
>> PME solve 3 1000001 5551.056
>> 1661.7 2.2
>>
>> --------------------------------------------------------------------------------------------------------------
>>
>> Parallel run - timing based on wallclock.
>>
>> NODE (s) Real (s) (%)
>> Time: 12621.257 12621.257 100.0
>> 3h30:21
>> (Mnbf/s) (GFlops) (ns/day)
>> (hour/ns)
>> Performance: 388.633 28.773 13.691 1.753
>> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>>
>> ######################################
>> Comparing the performance values for the two simulations I saw that in
>> "numeric terms" the simulation using the GPU gave (for example) ~27
>> ns/day, while when I used mpi this value is aproximatelly half (13.7
>> ns/day).
>> However, when I compared the time that each simulation
>> started/finished, the simulation using mpi tooks 211 minutes while the
>> gpu simulation tooked 320 minutes to finish.
>>
>> My questions are:
>>
>> 1. Why in the performace values I got better results with the GPU?
>
> Your CPU version probably can be optimized a bit. You should use HT and run
> on 12. Make sure PME/PP is balanced and use the best rlist/fourier_spacing
> ratio. Also your PME accuracy is rather high. Make sure you need that (.11
> fourier spacing should be accurate enough for rlist of .9). Your PME node
> spent 23% waiting on the PP nodes.
>>
>> 2. Why the simulation running on GPU was 109 min. slower than on 6
>> cores, since my video card is a GTX 480 with 480 gpu cores? I was
>> expecting that the GPU would accelerate greatly the simulations.
>
> The output you posted says the GPU version was faster (running only for
> 106min) The CPU cores are much more powerful. I would expect them to be at
> about as fast as the GPU.
> Roland
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list