[gmx-developers] Branches w/working OpenCL support

Mirco Wahab mirco.wahab at chemie.tu-freiberg.de
Mon Jun 1 21:26:44 CEST 2015


Hi Szilard,

thanks for your response.

On 31.05.2015 22:15, Szilárd Páll wrote:
>> After fixing two very small problems, I got it compiled
>> smoothly w/VS 2013 (VC12) and linked against ATI's actual
>> AppSDK.
>
> Those could be fixed in the code version that's on gerrit - although
> most changes have been backported to the Streamcomputing github repo,
> AFAIK:
> https://gerrit.gromacs.org/#/c/4314/
>
> If they are not, could you post the fixes so we can get them into the
> pending change?

I used this version now and had to make some small changes,
'listed-forces/bonded.cpp' uses "binary numerals" which are
C++14 (afaik) and will be in the MS tool-chain with VS 2015.
On another place, CV requires the inclusion of <algorithm>
when using std::min/max.  Thats it. It's running great so
far (at my workplace on an Nvidia card w/OpenCL 1.1). I'll
have access to my AMD card only after Thursday, so I used
the NV device here for first tests.

> I'm curious how do you define "comparable"? :)

As I see now, the OpenCL speedups on this NV test system
(i7/2600K + GTX-780) are in the range of 2x, whereas
the AMD system showed ~ 2.5x speedup. So that's at least
the same order of magnitude ...

rough example (ns/sec):
  adh-cubic-vsites - i7/2600k, GTX-780(353.06), Cuda 7.0.28, OpenCL 1.1/2
   rf      12.3/cpu   24.7/cpu+opencl  33.5/cpu+cuda
   pme     7.8/cpu    15/cpu+opencl    19/cpu+cuda

> Could you please share log files (would be great if you could repeat
> each run 2-3 times)? There are some overheads we observe on Linux
> which may or may not affect the Windows builds.

I have collected many log files for the tests provided on
the gromacs-acceleration page and will post them eventually.
Interestingly, the version downloaded from gerrit (4314) is
consistently 5-10% faster than the streamcomputing version
(which I timed both). The run times of repeated runs are
very close within a few percent if the pme autotuner finds
the same grid size (which sometimes can differ slightly).

One puzzle I didn't solve: nVidia provides OpenCL 1.2 dll's
with newer drivers (after 350.x or so), but Cuda7 only has
older libraries that don't seem to provide 1.2 functionality
(at least cmake isn't able to detect it), so I'm forced to
use 1.1 although 1.2 is there?

Another problem on windows: GMX_OCL_FILE_PATH environment
variable. This is supposed to be the location of the
runtime generated kernels. I didn't succeed to provide
any usable "form" of path description on windows, so I set
it empty to
    GMX_OCL_FILE_PATH=
which creates the kernels in the current simulation run
directory.


Thanks & regards

M.

=> diff for https://gerrit.gromacs.org/#/c/4314/

----- 8< ----- [cut here] -----

diff --git a/src/gromacs/gmxana/gmx_wham.cpp 
b/src/gromacs/gmxana/gmx_wham.cpp
index cdcd4bc..b47555f 100644
--- a/src/gromacs/gmxana/gmx_wham.cpp
+++ b/src/gromacs/gmxana/gmx_wham.cpp
@@ -50,7 +50,7 @@
  #include <string.h>

  #include <sstream>
-
+#include <algorithm> // std::min/max on VC
  #include "gromacs/commandline/pargs.h"
  #include "gromacs/fileio/tpxio.h"
  #include "gromacs/fileio/xvgr.h"
diff --git a/src/gromacs/listed-forces/bonded.cpp 
b/src/gromacs/listed-forces/bonded.cpp
index 2e2c859..9326e79 100644
--- a/src/gromacs/listed-forces/bonded.cpp
+++ b/src/gromacs/listed-forces/bonded.cpp
@@ -131,15 +131,21 @@ gmx_hack_simd_transpose4_r(gmx_simd_float_t *row0,
                             gmx_simd_float_t *row3)
  {
      __m256 tmp0, tmp1, tmp2, tmp3;
-
      tmp0  = _mm256_unpacklo_ps(*row0, *row1);
      tmp2  = _mm256_unpacklo_ps(*row2, *row3);
      tmp1  = _mm256_unpackhi_ps(*row0, *row1);
      tmp3  = _mm256_unpackhi_ps(*row2, *row3);
+#if (_MSC_VER <= 1800)
+    *row0 = _mm256_shuffle_ps(tmp0, tmp2, uint16_t(0x4444U));
+    *row1 = _mm256_shuffle_ps(tmp0, tmp2, uint16_t(0xEEEEU));
+    *row2 = _mm256_shuffle_ps(tmp1, tmp3, uint16_t(0x4444U));
+    *row3 = _mm256_shuffle_ps(tmp1, tmp3, uint16_t(0xEEEEU));
+#else
      *row0 = _mm256_shuffle_ps(tmp0, tmp2, 0b0100010001000100);
      *row1 = _mm256_shuffle_ps(tmp0, tmp2, 0b1110111011101110);
      *row2 = _mm256_shuffle_ps(tmp1, tmp3, 0b0100010001000100);
      *row3 = _mm256_shuffle_ps(tmp1, tmp3, 0b1110111011101110);
+#endif
  }

  static gmx_inline void gmx_simdcall
diff --git a/src/gromacs/mdlib/clincs.cpp b/src/gromacs/mdlib/clincs.cpp
index a1b9ed8..093af4e 100644
--- a/src/gromacs/mdlib/clincs.cpp
+++ b/src/gromacs/mdlib/clincs.cpp
@@ -155,10 +155,17 @@ gmx_hack_simd_transpose4_r(gmx_simd_float_t *row0,
      tmp2  = _mm256_unpacklo_ps(*row2, *row3);
      tmp1  = _mm256_unpackhi_ps(*row0, *row1);
      tmp3  = _mm256_unpackhi_ps(*row2, *row3);
+#if (_MSC_VER <= 1800)
+    *row0 = _mm256_shuffle_ps(tmp0, tmp2, uint16_t(0x4444U));
+    *row1 = _mm256_shuffle_ps(tmp0, tmp2, uint16_t(0xEEEEU));
+    *row2 = _mm256_shuffle_ps(tmp1, tmp3, uint16_t(0x4444U));
+    *row3 = _mm256_shuffle_ps(tmp1, tmp3, uint16_t(0xEEEEU));
+#else
      *row0 = _mm256_shuffle_ps(tmp0, tmp2, 0b0100010001000100);
      *row1 = _mm256_shuffle_ps(tmp0, tmp2, 0b1110111011101110);
      *row2 = _mm256_shuffle_ps(tmp1, tmp3, 0b0100010001000100);
      *row3 = _mm256_shuffle_ps(tmp1, tmp3, 0b1110111011101110);
+#endif
  }

  static gmx_inline void gmx_simdcall

<= ------------------------------



More information about the gromacs.org_gmx-developers mailing list