[gmx-users] GPU slower than I7
Renato Freitas
renatoffs at gmail.com
Fri Oct 22 21:20:55 CEST 2010
Hi Szilárd,
Thans for your explanation. Do you know if there will be a new
improvement of PME algorithms to take the full advantage of GPU video
cards?
Do you think that the "NODE" and "Real" time difference could be
attributed to some compilation problem in the mdrun-gpu. Despite I'm
asking this I didn't get any error in the compilation.
Thanks,
Renato
2010/10/22 Szilárd Páll <szilard.pall at cbr.su.se>:
> Hi Renato,
>
> First of all, what you're seeing is pretty normal, especially that you
> have a CPU that is crossing the border of insane :) Why is it normal?
> The PME algorithms are just simply not very well not well suited for
> current GPU architectures. With an ill-suited algorithm you won't be
> able to see the speedups you can often see in other application areas
> - -even more so that you're comparing to Gromacs on a i7 980X. For
> more info + benchmarks see the Gromacs-GPU page:
> http://www.gromacs.org/gpu
>
> However, there is one strange thing you also pointed out. The fact
> that the "NODE" and "Real" time in your mdrun-gpu timing summary is
> not the same, but has 3x deviation is _very_ unusual. I've ran
> mdrun-gpu on quite a wide variety of hardware but I've never seen
> those two counter deviate. It might be an artifact from the cycle
> counters used internally that behave in an unusual way on your CPU.
>
> One other thing I should point out is that you would be better off
> using the standard mdrun which in 4.5 by default has thread-support
> and therefore will run on a single cpu/node without MPI!
>
> Cheers,
> --
> Szilárd
>
>
>
> On Thu, Oct 21, 2010 at 9:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:
>> Hi gromacs users,
>>
>> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
>> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
>> mpi version. Also I compiled the GPU-accelerated
>> version of gromacs. Then I did a 2 ns simulation using a small system
>> (11042 atoms) to compare the performance of mdrun-gpu vs mdrun_mpi.
>> The results that I got are bellow:
>>
>> ############################################
>> My *.mdp is:
>>
>> constraints = all-bonds
>> integrator = md
>> dt = 0.002 ; ps !
>> nsteps = 1000000 ; total 2000 ps.
>> nstlist = 10
>> ns_type = grid
>> coulombtype = PME
>> rvdw = 0.9
>> rlist = 0.9
>> rcoulomb = 0.9
>> fourierspacing = 0.10
>> pme_order = 4
>> ewald_rtol = 1e-5
>> vdwtype = cut-off
>> pbc = xyz
>> epsilon_rf = 0
>> comm_mode = linear
>> nstxout = 1000
>> nstvout = 0
>> nstfout = 0
>> nstxtcout = 1000
>> nstlog = 1000
>> nstenergy = 1000
>> ; Berendsen temperature coupling is on in four groups
>> tcoupl = berendsen
>> tc-grps = system
>> tau-t = 0.1
>> ref-t = 298
>> ; Pressure coupling is on
>> Pcoupl = berendsen
>> pcoupltype = isotropic
>> tau_p = 0.5
>> compressibility = 4.5e-5
>> ref_p = 1.0
>> ; Generate velocites is on at 298 K.
>> gen_vel = no
>>
>> ########################
>> RUNNING GROMACS ON GPU
>>
>> mdrun-gpu -s topol.tpr -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
>> .
>> .
>> .
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Number G-Cycles Seconds %
>> ------------------------------------------------------------------------------------------------------
>> Write traj. 1 1021 106.075 31.7 0.2
>> Rest 1 64125.577 19178.6 99.8
>> ------------------------------------------------------------------------------------------------------
>> Total 1 64231.652 19210.3 100.0
>> ------------------------------------------------------------------------------------------------------
>>
>> NODE (s) Real (s) (%)
>> Time: 6381.840 19210.349 33.2
>> 1h46:21
>> (Mnbf/s) (MFlops) (ns/day) (hour/ns)
>> Performance: 0.000 0.001 27.077 0.886
>>
>> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>>
>> ########################
>> RUNNING GROMACS ON MPI
>>
>> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>>
>> Here is a part of the md.log:
>>
>> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Number G-Cycles Seconds %
>> --------------------------------------------------------------------------------------------------------------
>> Domain decomp. 3 100001 1452.166 434.7 0.6
>> DD comm. load 3 10001 0.745 0.2
>> 0.0
>> Send X to PME 3 1000001 249.003 74.5
>> 0.1
>> Comm. coord. 3 1000001 637.329 190.8
>> 0.3
>> Neighbor search 3 100001 8738.669 2616.0
>> 3.5
>> Force 3 1000001 99210.202
>> 29699.2 39.2
>> Wait + Comm. F 3 1000001 3361.591 1006.3 1.3
>> PME mesh 3 1000001 66189.554 19814.2
>> 26.2
>> Wait + Comm. X/F 3 60294.513 8049.5 23.8
>> Wait + Recv. PME F 3 1000001 801.897 240.1 0.3
>> Write traj. 3 1015 33.464
>> 10.0 0.0
>> Update 3 1000001 3295.820
>> 986.6 1.3
>> Constraints 3 1000001 6317.568
>> 1891.2 2.5
>> Comm. energies 3 100002 70.784 21.2
>> 0.0
>> Rest 3 2314.844
>> 693.0 0.9
>> --------------------------------------------------------------------------------------------------------------
>> Total 6 252968.148 75727.5
>> 100.0
>> --------------------------------------------------------------------------------------------------------------
>> --------------------------------------------------------------------------------------------------------------
>> PME redist. X/F 3 2000002 1945.551 582.4
>> 0.8
>> PME spread/gather 3 2000002 37219.607 11141.9 14.7
>> PME 3D-FFT 3 2000002 21453.362 6422.2
>> 8.5
>> PME solve 3 1000001 5551.056
>> 1661.7 2.2
>> --------------------------------------------------------------------------------------------------------------
>>
>> Parallel run - timing based on wallclock.
>>
>> NODE (s) Real (s) (%)
>> Time: 12621.257 12621.257 100.0
>> 3h30:21
>> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
>> Performance: 388.633 28.773 13.691 1.753
>> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>>
>> ######################################
>> Comparing the performance values for the two simulations I saw that in
>> "numeric terms" the simulation using the GPU gave (for example) ~27
>> ns/day, while when I used mpi this value is aproximatelly half (13.7
>> ns/day).
>> However, when I compared the time that each simulation
>> started/finished, the simulation using mpi tooks 211 minutes while the
>> gpu simulation tooked 320 minutes to finish.
>>
>> My questions are:
>>
>> 1. Why in the performace values I got better results with the GPU?
>>
>> 2. Why the simulation running on GPU was 109 min. slower than on 6
>> cores, since my video card is a GTX 480 with 480 gpu cores? I was
>> expecting that the GPU would accelerate greatly the simulations.
>>
>>
>> Does anyone have some idea?
>>
>> Thanks,
>>
>> Renato
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list