[gmx-users] Gromacs compilation on AMD multicore

Wed Jul 6 22:47:35 CEST 2011

Also for additional performance benefit you can use atlas math library
and compile gromacs and fftw with additional cflags like '-mfpmath=sse -
m<sse_version>' This will enable sse math (by default gcc generates i386 and 
sse).

PS Also you can use CFLAGS="-O2 -pipe -march=native -mfpmath=sse -
m<sse_version>'

PPS also on large numa systems cpuaffinity gains performance benefit. I made 
simple test with modifyed d.ddpc [1] params from gmxbench. Test platform was 
AMD 4xOpteron 6174 (12 core each) with 128G ram (32G per socket) so it was 48 
cores total.

All tests were with double precision version of gromacs and openmpi 1.5.3

mpi  without cpu affinity

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        40       5001     4782.871     2173.6     0.7
 DD comm. load         40       5000      262.645      119.4     0.0
 DD comm. bounds       40       5000     1110.348      504.6     0.2
 Send X to PME         40      50001     3375.583     1534.1     0.5
 Comm. coord.          40      50001    25220.579    11461.8     3.8
 Neighbor search       40       5001    18235.819     8287.5     2.8
 Force                 40      50001   274274.633   124647.7    41.4
 Wait + Comm. F        40      50001    86122.984    39139.7    13.0
 PME mesh               8      50001    76654.251    34836.5    11.6
 Wait + Comm. X/F       8               33769.340    15346.9     5.1
 Wait + Recv. PME F    40      50001    64206.340    29179.4     9.7
 Write traj.           40          7       21.849        9.9     0.0
 Update                40      50001      596.730      271.2     0.1
 Constraints           40      50001    71372.099    32436.0    10.8
 Comm. energies        40       5002     1065.027      484.0     0.2
 Rest                  40                1471.920      668.9     0.2
-----------------------------------------------------------------------
 Total                 48              662543.017   301101.4   100.0
-----------------------------------------------------------------------
-----------------------------------------------------------------------
 PME redist. X/F        8     100002    24791.282    11266.7     3.7
 PME spread/gather      8     100002    19021.991     8644.8     2.9
 PME 3D-FFT             8     100002    27949.861    12702.2     4.2
 PME solve              8      50001     4885.045     2220.1     0.7
-----------------------------------------------------------------------

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:   6272.946   6272.946    100.0
                       1h44:32
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    750.030     37.152      1.377     17.425

and mpi version with cpuaffinity via hwloc library
     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        40       5001     2731.213     1241.3     0.7
 DD comm. load         40       5000       13.200        6.0     0.0
 DD comm. bounds       40       5000      185.423       84.3     0.1
 Send X to PME         40      50001     1034.850      470.3     0.3
 Comm. coord.          40      50001     3810.865     1732.0     1.0
 Neighbor search       40       5001    17518.149     7961.9     4.8
 Force                 40      50001   256996.831   116803.7    70.3
 Wait + Comm. F        40      50001     9129.265     4149.2     2.5
 PME mesh               8      50001    33662.705    15299.5     9.2
 Wait + Comm. X/F       8               27242.617    12381.6     7.5
 Wait + Recv. PME F    40      50001      526.714      239.4     0.1
 Write traj.           40          4        7.425        3.4     0.0
 Update                40      50001      486.077      220.9     0.1
 Constraints           40      50001    10789.300     4903.7     3.0
 Comm. energies        40       5002      199.327       90.6     0.1
 Rest                  40                1098.793      499.4     0.3
-----------------------------------------------------------------------
 Total                 48              365432.754   166087.3   100.0
-----------------------------------------------------------------------
-----------------------------------------------------------------------
 PME redist. X/F        8     100002     2379.702     1081.6     0.7
 PME spread/gather      8     100002    11659.625     5299.2     3.2
 PME 3D-FFT             8     100002    14864.210     6755.7     4.1
 PME solve              8      50001     4753.059     2160.2     1.3
-----------------------------------------------------------------------

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:   3460.152   3460.152    100.0
                       57:40
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:   1360.380     67.390      2.497      9.611

So its about 2x performance hit =)

[1] http://omrb.pnpi.spb.ru/~alexxy/test.tar.bz2

On Wednesday 06 of July 2011 03:33:22 Szilárd Páll wrote:
> Additionally, if you care about a few percent extra performance, you
> should use gcc 4.5 or 4.6 for compiling Gromacs as well as FFTW
> (unless you have a bleeding-edge OS which was built with any of these
> latest gcc versions). While you might not see a lot of improvement in
> mdrun performance (wrt gcc >v4.5), as far as I remember, FFTW gets
> slightly more boost form the new gcc versions.
> 
> I can't comment on other compilers, I haven't tried to run binaries
> compiled with Intel Compiler on AMD lately.
> 
> --
> Szilárd
> 
-- 
Best Regards,
 Alexey 'Alexxy' Shvetsov
 Petersburg Nuclear Physics Institute, Russia
 Department of Molecular and Radiation Biophysics
 Gentoo Team Ru
 Gentoo Linux Dev
 mailto:alexxyum at gmail.com
 mailto:alexxy at gentoo.org
 mailto:alexxy at omrb.pnpi.spb.ru
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110707/51429c4c/attachment.sig>