[gmx-users] Question about GPU acceleration

chip chip at bio.gnu.ac.kr
Fri Nov 28 03:27:33 CET 2014


Greetings,

 

I did simulations using 1 GPU and 2 GPUs. (System size : 203,009 atoms)

They show almost same performance. 

 

- Simulation 1 : i7-4790K / GTX980 x 1 / 32 GB RAM --> 6.179 ns/day

- Simulation 2 : i7-4790K / GTX980 x 2 / 32 GB RAM --> 6.607 ns/day

 

Are these maximum performance?

or caused by maxwell chipset? 

In GPU acceleration part of GROMACS website, it says "GPUs with Fermi and
Kepler chips" 

or does it need more optimization at configure time?

Please give me your advices.

 

The details about system settings and calculation time are as following: 

Fedora20 (Linux 3.11.10-301.fc20.x86_64 x86_64)

Gromacs 5.0.2 (single precision) / fftw-3.3.4-sse2 / CUDA 6.5.12

MPI library: thread_mpi

SIMD instructions: AVX_256

RDTSCP usage: enabled

C++11 compilation: disabled

TNG support: enabled

Tracing support: disabled

C compiler: /usr/lib64/ccache/gcc GNU 4.8.3

C++ compiler: /usr/lib64/ccache/g++ GNU 4.8.3

Boost version: 1.55.0 (internal)

 

 

- Simulation 1 : i7-4790K / GTX980 x 1 / 32 GB RAM

Using 1 MPI thread

Using 8 OpenMP threads

Compiled SIMD instructions: AVX_256

 

1 GPU detected:

#0: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible

 

1 GPU auto-selected for this run.

Mapping of GPU to the 1 PP rank in this node: #0

 

Computing / Num Ranks / Num Threads / Call Count / Wall time (s) /
Giga-Cycles / %

Neighbor search / 1 / 8 / 626 / 12.239 / 391.657 / 1.8

Launch GPU ops. / 1 / 8  / 25001 / 3.163 / 101.235 / 0.5

Force / 1 / 8 / 25001 / 136.154 / 4357.077 / 19.5

PME mesh / 1 / 8 / 25001 / 328.789 / 10521.633 / 47.0

Wait GPU local / 1 / 8 / 25001 / 4.544 / 145.429 / 0.7

NB X/F buffer ops. / 1 / 8 / 49376 / 14.677 / 469.693 / 2.1

Write traj. / 1 / 8 / 51 / 0.903 / 28.910 / 0.1

Update / 1 / 8 / 25001 / 35.727 / 1143.305 / 5.1

Constraints / 1 / 8 / 25001 / 98.884 / 3164.399 / 14.1

Rest / / / / 64.045 / 2049.504 / 9.2

Total / / / / 699.127 / 22372.840 / 100.0

 

Force evaluation time GPU/CPU: 16.283 ms/18.597 ms = 0.876

For optimal performance this ratio should be close to 1!

 

Core t (s) : 5517.331   /   Wall t (s) : 699.127   /   (%) : 789.2

(ns/day) : 6.179   /   (hour/ns) : 3.884

 

 

 

- Simulation 2 : i7-4790K / GTX980 x 2 / 32 GB RAM

Using 2 MPI threads

Using 4 OpenMP threads per tMPI thread

Compiled SIMD instructions: AVX_256

 

2 GPUs detected:

#0: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible

#1: NVIDIA GeForce GTX 980, compute cap.: 5.2, ECC: no, stat: compatible

 

2 GPUs auto-selected for this run.

Mapping of GPUs to the 2 PP ranks in this node: #0, #1

 

av. #atoms communicated per step for force: 2 x 71069.2

av. #atoms communicated per step for LINCS: 2 x 3830.1

 

Average load imbalance: 3.8 %

Part of the total run time spent waiting due to load imbalance: 0.9 %

 

Computing / Num Ranks / Num Threads / Call Count / Wall time (s) /
Giga-Cycles / %

Domain decomp. / 2 / 4 / 625 / 13.465 / 430.878 / 2.1

DD comm. Load / 2 / 4 / 122 / 0.000 / 0.009 / 0.0

Neighbor search / 2 / 4 / 626 / 17.728 / 567.309 / 2.7

Launch GPU ops. / 2 / 4 / 50002 / 3.738 / 119.623 / 0.6

Comm. coord. / 2 / 4 / 24375 / 8.803 / 281.716 / 1.4

Force / 2 / 4 / 25001 / 124.601 / 3987.320 / 19.1

Wait + Comm. F / 2 / 4 / 25001 / 16.110 / 515.532 / 2.5

PME mesh / 2 / 4 / 25001 / 258.526 / 8272.989 / 39.7

Wait GPU nonlocal / 2 / 4 / 25001 / 9.763 / 312.417 / 1.5

Wait GPU local / 2 / 4 / 25001 / 0.107 / 3.436 / 0.0

NB X/F buffer ops. / 2 / 4 / 98752 / 18.668 / 597.392 / 2.9

Write traj. / 2 / 4 / 51 / 0.523 / 16.736 / 0.1

Update / 2 / 4 / 25001 / 34.983 / 1119.480 / 5.4

Constraints / 2 / 4 / 25001 / 100.109 / 3203.546 / 15.4

Comm. energies / 2 / 4 / 25001 / 0.572 / 18.318 / 0.1

Rest / / / / 43.410 / 1389.150 / 6.7

Total / / / / 651.108 / 20835.850 / 100.0

 

Core t (s) : 5174.321   /   Wall t (s) : 653.845   /   (%) : 791.4

(ns/day) : 6.607   /   (hour/ns) : 3.632

 

 

 

And additionally, I did another simulation using 16 GB RAM.

It also showed similar performance compared to 32GB RAM.

 

- Simulation 3 : i7-4790K / GTX980 x 1 / 16 GB RAM --> 6.165 ns/day

 

Is it too much RAM for running GROMACS?

 

 

Thank you.



More information about the gromacs.org_gmx-users mailing list