[gmx-users] Question about GPU acceleration

Fri Nov 28 10:00:16 CET 2014

On Fri, Nov 28, 2014 at 2:37 AM, chip <chip at bio.gnu.ac.kr> wrote:

> Greetings,
>
>
>
> I did simulations using 1 GPU and 2 GPUs. (System size : 203,009 atoms)
>
> They show almost same performance.
>
>
>
> - Simulation 1 : i7-4790K / GTX980 x 1 / 32 GB RAM --> 6.179 ns/day
>
> - Simulation 2 : i7-4790K / GTX980 x 2 / 32 GB RAM --> 6.607 ns/day
>

Gromacs makes heavy use of both the CPU and the GPU, with limited ability
to shift load between the two, so a setup that is already limited by the
ability of one half won't benefit from doubling the resources available to
the other half. In particular, if your CPU has only 4 real cores (ie. 8
from hyperthreading) then supporting a second GPU is basically not a useful
option.

>
>
>
> Are these maximum performance?
>
> or caused by maxwell chipset?
>
> In GPU acceleration part of GROMACS website, it says "GPUs with Fermi and
> Kepler chips"
>
> or does it need more optimization at configure time?
>
> Please give me your advices.
>
>
>
> The details about system settings and calculation time are as following:
>
> Fedora20 (Linux 3.11.10-301.fc20.x86_64 x86_64)
>
> Gromacs 5.0.2 (single precision) / fftw-3.3.4-sse2 / CUDA 6.5.12
>
> MPI library: thread_mpi
>
> SIMD instructions: AVX_256
>
> RDTSCP usage: enabled
>
> C++11 compilation: disabled
>
> TNG support: enabled
>
> Tracing support: disabled
>
> C compiler: /usr/lib64/ccache/gcc GNU 4.8.3
>
> C++ compiler: /usr/lib64/ccache/g++ GNU 4.8.3
>
> Boost version: 1.55.0 (internal)
>
>
>
>
>
> - Simulation 1 : i7-4790K / GTX980 x 1 / 32 GB RAM
>
> Using 1 MPI thread
>
> Using 8 OpenMP threads
>
> Compiled SIMD instructions: AVX_256
>
>
>
> 1 GPU detected:
>
> #0: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible
>
>
>
> 1 GPU auto-selected for this run.
>
> Mapping of GPU to the 1 PP rank in this node: #0
>
>
>
> Computing / Num Ranks / Num Threads / Call Count / Wall time (s) /
> Giga-Cycles / %
>
> Neighbor search / 1 / 8 / 626 / 12.239 / 391.657 / 1.8
>
> Launch GPU ops. / 1 / 8  / 25001 / 3.163 / 101.235 / 0.5
>
> Force / 1 / 8 / 25001 / 136.154 / 4357.077 / 19.5
>
> PME mesh / 1 / 8 / 25001 / 328.789 / 10521.633 / 47.0
>
> Wait GPU local / 1 / 8 / 25001 / 4.544 / 145.429 / 0.7
>
> NB X/F buffer ops. / 1 / 8 / 49376 / 14.677 / 469.693 / 2.1
>
> Write traj. / 1 / 8 / 51 / 0.903 / 28.910 / 0.1
>
> Update / 1 / 8 / 25001 / 35.727 / 1143.305 / 5.1
>
> Constraints / 1 / 8 / 25001 / 98.884 / 3164.399 / 14.1
>
> Rest / / / / 64.045 / 2049.504 / 9.2
>
> Total / / / / 699.127 / 22372.840 / 100.0
>
>
>
> Force evaluation time GPU/CPU: 16.283 ms/18.597 ms = 0.876
>
> For optimal performance this ratio should be close to 1!
>

Already this simulation spent more time on the CPU load than the GPU load...

>
>
> Core t (s) : 5517.331   /   Wall t (s) : 699.127   /   (%) : 789.2
>
> (ns/day) : 6.179   /   (hour/ns) : 3.884
>
>
>
>
>
>
>
> - Simulation 2 : i7-4790K / GTX980 x 2 / 32 GB RAM
>
> Using 2 MPI threads
>
> Using 4 OpenMP threads per tMPI thread
>
> Compiled SIMD instructions: AVX_256
>
>
>
> 2 GPUs detected:
>
> #0: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible
>
> #1: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible
>
>
>
> 2 GPUs auto-selected for this run.
>
> Mapping of GPUs to the 2 PP ranks in this node: #0, #1
>
>
>
> av. #atoms communicated per step for force: 2 x 71069.2
>
> av. #atoms communicated per step for LINCS: 2 x 3830.1
>
>
>
> Average load imbalance: 3.8 %
>
> Part of the total run time spent waiting due to load imbalance: 0.9 %
>
>
>
> Computing / Num Ranks / Num Threads / Call Count / Wall time (s) /
> Giga-Cycles / %
>
> Domain decomp. / 2 / 4 / 625 / 13.465 / 430.878 / 2.1
>
> DD comm. Load / 2 / 4 / 122 / 0.000 / 0.009 / 0.0
>
> Neighbor search / 2 / 4 / 626 / 17.728 / 567.309 / 2.7
>
> Launch GPU ops. / 2 / 4 / 50002 / 3.738 / 119.623 / 0.6
>
> Comm. coord. / 2 / 4 / 24375 / 8.803 / 281.716 / 1.4
>
> Force / 2 / 4 / 25001 / 124.601 / 3987.320 / 19.1
>
> Wait + Comm. F / 2 / 4 / 25001 / 16.110 / 515.532 / 2.5
>
> PME mesh / 2 / 4 / 25001 / 258.526 / 8272.989 / 39.7
>
> Wait GPU nonlocal / 2 / 4 / 25001 / 9.763 / 312.417 / 1.5
>
> Wait GPU local / 2 / 4 / 25001 / 0.107 / 3.436 / 0.0
>
> NB X/F buffer ops. / 2 / 4 / 98752 / 18.668 / 597.392 / 2.9
>
> Write traj. / 2 / 4 / 51 / 0.523 / 16.736 / 0.1
>
> Update / 2 / 4 / 25001 / 34.983 / 1119.480 / 5.4
>
> Constraints / 2 / 4 / 25001 / 100.109 / 3203.546 / 15.4
>
> Comm. energies / 2 / 4 / 25001 / 0.572 / 18.318 / 0.1
>
> Rest / / / / 43.410 / 1389.150 / 6.7
>
> Total / / / / 651.108 / 20835.850 / 100.0
>
>
>
> Core t (s) : 5174.321   /   Wall t (s) : 653.845   /   (%) : 791.4
>
> (ns/day) : 6.607   /   (hour/ns) : 3.632
>
>
>
There's no other problems revealed here (but please do not make the
pre-formatted tables hard to read by introducing separators!) but the
diagnostic information is probably earlier, where the load balancing takes
place. Uploading your .log files to a file-sharing service and sharing the
URLs will be more useful.

>
> And additionally, I did another simulation using 16 GB RAM.
>
> It also showed similar performance compared to 32GB RAM.
>
>
>
> - Simulation 3 : i7-4790K / GTX980 x 1 / 16 GB RAM --> 6.165 ns/day
>
>
>
> Is it too much RAM for running GROMACS?
>

No, you can't have too much. Gromacs needs basically nothing, though.

Mark

>
>
>
>
> Thank you.
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>