[gmx-users] GPU slower than I7
Roland Schulz
roland at utk.edu
Thu Oct 21 22:56:02 CEST 2010
On Thu, Oct 21, 2010 at 3:18 PM, Renato Freitas <renatoffs at gmail.com> wrote:
> Hi gromacs users,
>
> I have installed the lastest version of gromacs (4.5.1) in an i7 980X
> (6 cores or 12 with HT on; 3.3 GHz) with 12GB of RAM and compiled its
> mpi version. Also I compiled the GPU-accelerated
> version of gromacs. Then I did a 2 ns simulation using a small system
> (11042 atoms) to compare the performance of mdrun-gpu vs mdrun_mpi.
> The results that I got are bellow:
>
> ############################################
> My *.mdp is:
>
> constraints = all-bonds
> integrator = md
> dt = 0.002 ; ps !
> nsteps = 1000000 ; total 2000 ps.
> nstlist = 10
> ns_type = grid
> coulombtype = PME
> rvdw = 0.9
> rlist = 0.9
> rcoulomb = 0.9
> fourierspacing = 0.10
> pme_order = 4
> ewald_rtol = 1e-5
> vdwtype = cut-off
> pbc = xyz
> epsilon_rf = 0
> comm_mode = linear
> nstxout = 1000
> nstvout = 0
> nstfout = 0
> nstxtcout = 1000
> nstlog = 1000
> nstenergy = 1000
> ; Berendsen temperature coupling is on in four groups
> tcoupl = berendsen
> tc-grps = system
> tau-t = 0.1
> ref-t = 298
> ; Pressure coupling is on
> Pcoupl = berendsen
> pcoupltype = isotropic
> tau_p = 0.5
> compressibility = 4.5e-5
> ref_p = 1.0
> ; Generate velocites is on at 298 K.
> gen_vel = no
>
> ########################
> RUNNING GROMACS ON GPU
>
> mdrun-gpu -s topol.tpr -v > & out &
>
> Here is a part of the md.log:
>
> Started mdrun on node 0 Wed Oct 20 09:52:09 2010
> .
> .
> .
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
>
> ------------------------------------------------------------------------------------------------------
> Write traj. 1 1021 106.075 31.7
> 0.2
> Rest 1 64125.577 19178.6
> 99.8
>
> ------------------------------------------------------------------------------------------------------
> Total 1 64231.652 19210.3 100.0
>
> ------------------------------------------------------------------------------------------------------
>
> NODE (s) Real (s) (%)
> Time: 6381.840 19210.349 33.2
> 1h46:21
> (Mnbf/s) (MFlops) (ns/day) (hour/ns)
> Performance: 0.000 0.001 27.077 0.886
>
> Finished mdrun on node 0 Wed Oct 20 15:12:19 2010
>
> ########################
> RUNNING GROMACS ON MPI
>
> mpirun -np 6 mdrun_mpi -s topol.tpr -npme 3 -v > & out &
>
> Here is a part of the md.log:
>
> Started mdrun on node 0 Wed Oct 20 18:30:52 2010
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
>
> --------------------------------------------------------------------------------------------------------------
> Domain decomp. 3 100001 1452.166 434.7
> 0.6
> DD comm. load 3 10001 0.745 0.2
> 0.0
> Send X to PME 3 1000001 249.003 74.5
> 0.1
> Comm. coord. 3 1000001 637.329 190.8
> 0.3
> Neighbor search 3 100001 8738.669 2616.0
> 3.5
> Force 3 1000001 99210.202
> 29699.2 39.2
> Wait + Comm. F 3 1000001 3361.591 1006.3
> 1.3
> PME mesh 3 1000001 66189.554 19814.2
> 26.2
> Wait + Comm. X/F 3 60294.513 8049.5 23.8
> Wait + Recv. PME F 3 1000001 801.897 240.1
> 0.3
> Write traj. 3 1015 33.464
> 10.0 0.0
> Update 3 1000001 3295.820
> 986.6 1.3
> Constraints 3 1000001 6317.568
> 1891.2 2.5
> Comm. energies 3 100002 70.784 21.2
> 0.0
> Rest 3 2314.844
> 693.0 0.9
>
> --------------------------------------------------------------------------------------------------------------
> Total 6 252968.148 75727.5
> 100.0
>
> --------------------------------------------------------------------------------------------------------------
>
> --------------------------------------------------------------------------------------------------------------
> PME redist. X/F 3 2000002 1945.551 582.4
> 0.8
> PME spread/gather 3 2000002 37219.607 11141.9
> 14.7
> PME 3D-FFT 3 2000002 21453.362 6422.2
> 8.5
> PME solve 3 1000001 5551.056
> 1661.7 2.2
>
> --------------------------------------------------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 12621.257 12621.257 100.0
> 3h30:21
> (Mnbf/s) (GFlops) (ns/day)
> (hour/ns)
> Performance: 388.633 28.773 13.691 1.753
> Finished mdrun on node 0 Wed Oct 20 22:01:14 2010
>
> ######################################
> Comparing the performance values for the two simulations I saw that in
> "numeric terms" the simulation using the GPU gave (for example) ~27
> ns/day, while when I used mpi this value is aproximatelly half (13.7
> ns/day).
> However, when I compared the time that each simulation
> started/finished, the simulation using mpi tooks 211 minutes while the
> gpu simulation tooked 320 minutes to finish.
>
> My questions are:
>
> 1. Why in the performace values I got better results with the GPU?
>
Your CPU version probably can be optimized a bit. You should use HT and run
on 12. Make sure PME/PP is balanced and use the best rlist/fourier_spacing
ratio. Also your PME accuracy is rather high. Make sure you need that (.11
fourier spacing should be accurate enough for rlist of .9). Your PME node
spent 23% waiting on the PP nodes.
>
> 2. Why the simulation running on GPU was 109 min. slower than on 6
> cores, since my video card is a GTX 480 with 480 gpu cores? I was
> expecting that the GPU would accelerate greatly the simulations.
>
The output you posted says the GPU version was faster (running only for
106min) The CPU cores are much more powerful. I would expect them to be at
about as fast as the GPU.
Roland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20101021/d0461e2a/attachment.html>
More information about the gromacs.org_gmx-users
mailing list