[gmx-users] NVIDIA GTX cards in Rackable servers, how do you do it ?

Wed Feb 25 15:14:53 CET 2015

BS"D

Dear David and Szilárd

Here are the results from our tests:

Gromacs 5.0 with AVX2 on E5-2699v3
nodes cores core/socket ns/day "wall time, s" scaling ideal scaling ideal ns/day
1 12 6         20.023 4315.029 1.00 1.00         20.023
1 16 8         24.912 3468.233 1.24 1.33         26.697
1 20 10         28.567 3024.440 1.43 1.67         33.372
1 24 12         34.870 2477.786 1.74 2.00         40.046
1 28 14         40.202 2149.149 2.01 2.33         46.720
1 32 16         40.514 2132.615 2.02 2.67         53.395
2 24 6         36.739 2351.722 1.83 2.00         40.046
2 32 8         41.146 2099.856 2.05 2.67         53.395
2 40 10         52.974 1630.995 2.65 3.33         66.743
2 48 12         63.909 1351.929 3.19 4.00         80.092

The job using 18 cores failed.

I have to go and sort out the log files, if you want to seem them.

So I did make one mistake, and that one does get acceleration up to 14 cores, not  only 12, as I thought.  The issue is going from 14 to 16 cores per socket.  Pinning was used.
The test case is not a large system, which might play a role here.  If you've had positive experiences with 18 cores, then perhaps the

I have a question about the comment on AVX2 clock slowdown on Haswell:

However, let me add a few notes/warnings:
* The Xeon v3's clock is deceiving (borderline lie from Intel), in AVX
mode those 2699V3-s run at around 1.9 GHz; at that point the
difference between the two CPUs becomes quite likely <=25% and if
you'd take an E5-2697v2 which should be only a couple of 100s more
than the 2695v2 the difference would likely become even less;
* Instead of the E5-2699V3 I think you may be better off with the
E5-2697 v3 - especially if both drop the clock by 400 MHz in AVX mode.

While it's true the base clock speed for AVX is reduced (for the 2699 from 2.3 to 1.9, as you mention), the AVX clock speed is still subject to Turbo, and so one gets to 3.3 GHz, which is .3GHz slower than the normal maximum of 3.6GHz, but is a smaller percentage change.  Isn't that speed the more relevant metric under load?

Thanks

Harry

Please share the details because it is of our interest to understand
and address such issues if they are reproducible.

However, note, that I've ran on CPUs with up to 18 cores (and up to
36-96 threads per socket) and in most cases the multi-threaded code
scales quite well - as long as not combined with DD/MPI. There are
some known multi-threaded scaling issues that are beign addressed for
5.1, but without log files it's hard to know what is the nature of the
"performance penalty" you mention.

Note: HyperThreading and SMT in general changes the situation, but
that's a different topic.

-------------------------------------------------------------------------

Harry M. Greenblatt

Associate Staff Scientist

Dept of Structural Biology

Weizmann Institute of Science        Phone:  972-8-934-3625

234 Herzl St.                        Facsimile:   972-8-934-4159

Rehovot, 76100

Israel

Harry.Greenblatt at weizmann.ac.il<mailto:Harry.Greenblatt at weizmann.ac.il>