[gmx-users] performance 1 gpu
Johnny Lu
johnny.lu128 at gmail.com
Thu Sep 25 12:50:52 CEST 2014
Hi.
I wonder if gromacs 4.6.7 can run faster on xsede.org because I see cpu
waits for gpu in the log.
There is 16 cpu (2.7 GHz), 1 phi co-processor, and 1 GPU.
I compiled gromacs with gpu and without phi and with intel compiler and mkl.
I didn't install for 5.0.1 because I worry this bug might mess up
equilibration when I switch from one ensemble to another one (
http://redmine.gromacs.org/issues/1603).
Below are from the log:
Gromacs version: VERSION 4.6.7
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled
GPU support: enabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: AVX_256
FFT library: MKL
Large file support: enabled
RDTSCP usage: enabled
Built on: Wed Sep 24 08:33:22 CDT 2014
Built by: jlu128 at login2.stampede.tacc.utexas.edu [CMAKE]
Build OS/arch: Linux 2.6.32-431.17.1.el6.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Build CPU family: 6 Model: 45 Stepping: 7
Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
sse4.2 ssse3 tdt x2apic
C compiler:
/opt/apps/intel/13/composer_xe_2013.3.163/bin/intel64/icc Intel icc (ICC)
13.1.1 20130313
C compiler flags: -mavx -mkl=sequential -std=gnu99 -Wall -ip
-funroll-all-loops -O3 -DNDEBUG
C++ compiler:
/opt/apps/intel/13/composer_xe_2013.3.163/bin/intel64/icc Intel icc (ICC)
13.1.1 20130313
C++ compiler flags: -mavx -Wall -ip -funroll-all-loops -O3 -DNDEBUG
Linked with Intel MKL version 11.0.3.
CUDA compiler: /opt/apps/cuda/6.0/bin/nvcc nvcc: NVIDIA (R) Cuda
compiler driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on
Thu_Mar_13_11:58:58_PDT_2014;Cuda compilation tools, release 6.0, V6.0.1
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_20,code=sm_21;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_35,code=compute_35;-use_fast_math;;
-mavx;-Wall;-ip;-funroll-all-loops;-O3;-DNDEBUG
CUDA driver: 6.0
CUDA runtime: 6.0
...
Using 1 MPI thread
Using 16 OpenMP threads
Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Family: 6 Model: 45 Stepping: 7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256
1 GPU detected:
#0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0
Will do PME sum in reciprocal space.
...
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 1517304.154000 13655737.386 0.1
NxN Ewald Elec. + VdW [F] 370461474.587968 24450457322.806 92.7
NxN Ewald Elec. + VdW [V&F] 3742076.012672 400402133.356 1.5
1,4 nonbonded interactions 101910.006794 9171900.611 0.0
Calc Weights 1343655.089577 48371583.225 0.2
Spread Q Bspline 28664641.910976 57329283.822 0.2
Gather F Bspline 28664641.910976 171987851.466 0.7
3D-FFT 141557361.449024 1132458891.592 4.3
Solve PME 61439.887616 3932152.807 0.0
Shift-X 11197.154859 67182.929 0.0
Angles 71010.004734 11929680.795 0.0
Propers 108285.007219 24797266.653 0.1
Impropers 8145.000543 1694160.113 0.0
Virial 44856.029904 807408.538 0.0
Stop-CM 4478.909718 44789.097 0.0
Calc-Ekin 89577.059718 2418580.612 0.0
Lincs 39405.002627 2364300.158 0.0
Lincs-Mat 852120.056808 3408480.227 0.0
Constraint-V 487680.032512 3901440.260 0.0
Constraint-Vir 44827.529885 1075860.717 0.0
Settle 136290.009086 44021672.935 0.2
-----------------------------------------------------------------------------
Total 26384297680.107 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Th. Count Wall t (s) G-Cycles %
-----------------------------------------------------------------------------
Neighbor search 1 16 375001 578.663 24997.882 1.7
Launch GPU ops. 1 16 15000001 814.410 35181.984 2.3
Force 1 16 15000001 2954.603 127637.010 8.5
PME mesh 1 16 15000001 11736.454 507007.492 33.7
Wait GPU local 1 16 15000001 11159.455 482081.496 32.0
NB X/F buffer ops. 1 16 29625001 1061.959 45875.952 3.0
Write traj. 1 16 39 5.207 224.956 0.0
More information about the gromacs.org_gmx-users
mailing list