[gmx-users] performance 1 gpu

Thu Sep 25 12:50:52 CEST 2014

Hi.

I wonder if gromacs 4.6.7 can run faster on xsede.org because I see cpu
waits for gpu in the log.

There is 16 cpu (2.7 GHz), 1 phi co-processor, and 1 GPU.

I compiled gromacs with gpu and without phi and with intel compiler and mkl.

I didn't install for 5.0.1 because I worry this bug might mess up
equilibration when I switch from one ensemble to another one (
http://redmine.gromacs.org/issues/1603).

Below are from the log:

Gromacs version:    VERSION 4.6.7
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        enabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   AVX_256
FFT library:        MKL
Large file support: enabled
RDTSCP usage:       enabled
Built on:           Wed Sep 24 08:33:22 CDT 2014
Built by:           jlu128 at login2.stampede.tacc.utexas.edu [CMAKE]
Build OS/arch:      Linux 2.6.32-431.17.1.el6.x86_64 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Build CPU family:   6   Model: 45   Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
sse4.2 ssse3 tdt x2apic
C compiler:
/opt/apps/intel/13/composer_xe_2013.3.163/bin/intel64/icc Intel icc (ICC)
13.1.1 20130313
C compiler flags:   -mavx    -mkl=sequential -std=gnu99 -Wall   -ip
-funroll-all-loops  -O3 -DNDEBUG
C++ compiler:
/opt/apps/intel/13/composer_xe_2013.3.163/bin/intel64/icc Intel icc (ICC)
13.1.1 20130313
C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
Linked with Intel MKL version 11.0.3.
CUDA compiler:      /opt/apps/cuda/6.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;;
-mavx;-Wall;-ip;-funroll-all-loops;-O3;-DNDEBUG
CUDA driver:        6.0
CUDA runtime:       6.0

...
Using 1 MPI thread
Using 16 OpenMP threads

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Family:  6  Model: 45  Stepping:  7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256

1 GPU detected:
  #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible

1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0

Will do PME sum in reciprocal space.

...

 M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check         1517304.154000    13655737.386     0.1
 NxN Ewald Elec. + VdW [F]        370461474.587968 24450457322.806    92.7
 NxN Ewald Elec. + VdW [V&F]        3742076.012672   400402133.356     1.5
 1,4 nonbonded interactions          101910.006794     9171900.611     0.0
 Calc Weights                       1343655.089577    48371583.225     0.2
 Spread Q Bspline                  28664641.910976    57329283.822     0.2
 Gather F Bspline                  28664641.910976   171987851.466     0.7
 3D-FFT                           141557361.449024  1132458891.592     4.3
 Solve PME                            61439.887616     3932152.807     0.0
 Shift-X                              11197.154859       67182.929     0.0
 Angles                               71010.004734    11929680.795     0.0
 Propers                             108285.007219    24797266.653     0.1
 Impropers                             8145.000543     1694160.113     0.0
 Virial                               44856.029904      807408.538     0.0
 Stop-CM                               4478.909718       44789.097     0.0
 Calc-Ekin                            89577.059718     2418580.612     0.0
 Lincs                                39405.002627     2364300.158     0.0
 Lincs-Mat                           852120.056808     3408480.227     0.0
 Constraint-V                        487680.032512     3901440.260     0.0
 Constraint-Vir                       44827.529885     1075860.717     0.0
 Settle                              136290.009086    44021672.935     0.2
-----------------------------------------------------------------------------
 Total                                             26384297680.107   100.0
-----------------------------------------------------------------------------

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
-----------------------------------------------------------------------------
 Neighbor search        1   16     375001     578.663    24997.882     1.7
 Launch GPU ops.        1   16   15000001     814.410    35181.984     2.3
 Force                  1   16   15000001    2954.603   127637.010     8.5
 PME mesh               1   16   15000001   11736.454   507007.492    33.7
 Wait GPU local         1   16   15000001   11159.455   482081.496    32.0
 NB X/F buffer ops.     1   16   29625001    1061.959    45875.952     3.0
 Write traj.            1   16         39       5.207      224.956     0.0