[gmx-users] Gromacs compilation on AMD multicore
Alexey Shvetsov
alexxy at omrb.pnpi.spb.ru
Wed Jul 6 22:47:35 CEST 2011
Also for additional performance benefit you can use atlas math library
and compile gromacs and fftw with additional cflags like '-mfpmath=sse -
m<sse_version>' This will enable sse math (by default gcc generates i386 and
sse).
PS Also you can use CFLAGS="-O2 -pipe -march=native -mfpmath=sse -
m<sse_version>'
PPS also on large numa systems cpuaffinity gains performance benefit. I made
simple test with modifyed d.ddpc [1] params from gmxbench. Test platform was
AMD 4xOpteron 6174 (12 core each) with 128G ram (32G per socket) so it was 48
cores total.
All tests were with double precision version of gromacs and openmpi 1.5.3
mpi without cpu affinity
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 40 5001 4782.871 2173.6 0.7
DD comm. load 40 5000 262.645 119.4 0.0
DD comm. bounds 40 5000 1110.348 504.6 0.2
Send X to PME 40 50001 3375.583 1534.1 0.5
Comm. coord. 40 50001 25220.579 11461.8 3.8
Neighbor search 40 5001 18235.819 8287.5 2.8
Force 40 50001 274274.633 124647.7 41.4
Wait + Comm. F 40 50001 86122.984 39139.7 13.0
PME mesh 8 50001 76654.251 34836.5 11.6
Wait + Comm. X/F 8 33769.340 15346.9 5.1
Wait + Recv. PME F 40 50001 64206.340 29179.4 9.7
Write traj. 40 7 21.849 9.9 0.0
Update 40 50001 596.730 271.2 0.1
Constraints 40 50001 71372.099 32436.0 10.8
Comm. energies 40 5002 1065.027 484.0 0.2
Rest 40 1471.920 668.9 0.2
-----------------------------------------------------------------------
Total 48 662543.017 301101.4 100.0
-----------------------------------------------------------------------
-----------------------------------------------------------------------
PME redist. X/F 8 100002 24791.282 11266.7 3.7
PME spread/gather 8 100002 19021.991 8644.8 2.9
PME 3D-FFT 8 100002 27949.861 12702.2 4.2
PME solve 8 50001 4885.045 2220.1 0.7
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 6272.946 6272.946 100.0
1h44:32
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 750.030 37.152 1.377 17.425
and mpi version with cpuaffinity via hwloc library
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Number G-Cycles Seconds %
-----------------------------------------------------------------------
Domain decomp. 40 5001 2731.213 1241.3 0.7
DD comm. load 40 5000 13.200 6.0 0.0
DD comm. bounds 40 5000 185.423 84.3 0.1
Send X to PME 40 50001 1034.850 470.3 0.3
Comm. coord. 40 50001 3810.865 1732.0 1.0
Neighbor search 40 5001 17518.149 7961.9 4.8
Force 40 50001 256996.831 116803.7 70.3
Wait + Comm. F 40 50001 9129.265 4149.2 2.5
PME mesh 8 50001 33662.705 15299.5 9.2
Wait + Comm. X/F 8 27242.617 12381.6 7.5
Wait + Recv. PME F 40 50001 526.714 239.4 0.1
Write traj. 40 4 7.425 3.4 0.0
Update 40 50001 486.077 220.9 0.1
Constraints 40 50001 10789.300 4903.7 3.0
Comm. energies 40 5002 199.327 90.6 0.1
Rest 40 1098.793 499.4 0.3
-----------------------------------------------------------------------
Total 48 365432.754 166087.3 100.0
-----------------------------------------------------------------------
-----------------------------------------------------------------------
PME redist. X/F 8 100002 2379.702 1081.6 0.7
PME spread/gather 8 100002 11659.625 5299.2 3.2
PME 3D-FFT 8 100002 14864.210 6755.7 4.1
PME solve 8 50001 4753.059 2160.2 1.3
-----------------------------------------------------------------------
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 3460.152 3460.152 100.0
57:40
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 1360.380 67.390 2.497 9.611
So its about 2x performance hit =)
[1] http://omrb.pnpi.spb.ru/~alexxy/test.tar.bz2
On Wednesday 06 of July 2011 03:33:22 Szilárd Páll wrote:
> Additionally, if you care about a few percent extra performance, you
> should use gcc 4.5 or 4.6 for compiling Gromacs as well as FFTW
> (unless you have a bleeding-edge OS which was built with any of these
> latest gcc versions). While you might not see a lot of improvement in
> mdrun performance (wrt gcc >v4.5), as far as I remember, FFTW gets
> slightly more boost form the new gcc versions.
>
> I can't comment on other compilers, I haven't tried to run binaries
> compiled with Intel Compiler on AMD lately.
>
> --
> Szilárd
>
--
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110707/51429c4c/attachment.sig>
More information about the gromacs.org_gmx-users
mailing list