[gmx-users] GTX 960 vs Tesla K40
Alex
nedomacho at gmail.com
Mon Jun 18 23:35:04 CEST 2018
Persistence is enabled so I don't have to overclock again. To be honest, I
am still not entirely comfortable with the notion of ranks, after reading
the acceleration document a bunch of times. Parts of log file below and I
will obviously appreciate suggestions/clarifications:
Command line:
gmx mdrun -nt 4 -ntmpi 2 -npme 1 -pme gpu -nb gpu -s run_unstretch.tpr -o
traj_unstretch.trr -g md.log -c unstretched.gro
GROMACS version: 2018
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: SSE4.1
FFT library: fftw-3.3.5-sse2
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-02-13 19:43:29
Built by: smolyan at MINTbox [CMAKE]
Build OS/arch: Linux 4.4.0-112-generic x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Xeon(R) CPU W3530 @ 2.80GHz
Build CPU family: 6 Model: 26 Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/cc GNU 5.4.0
C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 5.4.0
C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on
Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler
flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;;
;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.10
CUDA runtime: 9.10
Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU W3530 @ 2.80GHz
Family: 6 Model: 26 Stepping: 5
Features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc
pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0]
Socket 1: [ 1]
Socket 2: [ 2]
Socket 3: [ 3]
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla K40c, compute cap.: 3.5, ECC: no, stat: compatible
................
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 547029.956656 4923269.610 0.0
NxN Ewald Elec. + LJ [F] 485658021.416832 32053429413.511 98.0
NxN Ewald Elec. + LJ [V&F] 4905656.839680 524905281.846 1.6
1,4 nonbonded interactions 140625.005625 12656250.506 0.0
Reset In Box 4599.000000 13797.000 0.0
CG-CoM 4599.018396 13797.055 0.0
Bonds 48000.001920 2832000.113 0.0
Angles 94650.003786 15901200.636 0.0
RB-Dihedrals 186600.007464 46090201.844 0.1
Pos. Restr. 2600.000104 130000.005 0.0
Virial 4610.268441 82984.832 0.0
Stop-CM 91.998396 919.984 0.0
Calc-Ekin 45990.036792 1241730.993 0.0
Constraint-V 318975.012759 2551800.102 0.0
Constraint-Vir 3189.762759 76554.306 0.0
Settle 106325.004253 34342976.374 0.1
Virtual Site 3 107388.258506 3973365.565 0.0
-----------------------------------------------------------------------------
Total 32703165544.282 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 0.0
av. #atoms communicated per step for vsites: 3 x 0.0
av. #atoms communicated per step for LINCS: 2 x 0.0
Average PME mesh/force load: 1.193
Part of the total run time spent waiting due to PP/PME imbalance: 5.1 %
NOTE: 5.1 % performance was lost because the PME ranks
had more work to do than the PP ranks.
You might want to increase the number of PME ranks
or increase the cut-off and the grid spacing.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank doing PP, using 2 OpenMP threads, and
on 1 MPI rank doing PME, using 2 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 1 2 250000 975.157 5461.106 1.0
DD comm. load 1 2 25002 0.009 0.053 0.0
Vsite constr. 1 2 25000001 2997.638 16787.470 3.1
Send X to PME 1 2 25000001 806.884 4518.740 0.8
Neighbor search 1 2 250001 1351.275 7567.455 1.4
Launch GPU ops. 1 2 50000002 7767.373 43499.093 8.0
Comm. coord. 1 2 24750000 4.359 24.410 0.0
Force 1 2 25000001 8994.482 50371.185 9.3
Wait + Comm. F 1 2 25000001 3.992 22.355 0.0
PME mesh * 1 2 25000001 30757.016 172246.434 31.7
PME wait for PP * 17821.979 99807.221 18.3
Wait + Recv. PME F 1 2 25000001 3355.753 18792.998 3.5
Wait PME GPU gather 1 2 25000001 25539.917 143029.467 26.3
Wait GPU NB nonloc. 1 2 25000001 61.503 344.432 0.1
Wait GPU NB local 1 2 25000001 15384.720 86158.005 15.8
NB X/F buffer ops. 1 2 99500002 1817.951 10180.950 1.9
Vsite spread 1 2 25250002 3417.205 19137.139 3.5
Write traj. 1 2 2554 18.100 101.362 0.0
Update 1 2 25000001 1832.047 10259.890 1.9
Constraints 1 2 25000001 3232.961 18105.330 3.3
Comm. energies 1 2 1250001 5.858 32.805 0.0
-----------------------------------------------------------------------------
Total 48578.997 544107.322 100.0
-----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to
twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 194315.986 48578.997 400.0
13h29:38
(ns/day) (hour/ns)
Performance: 88.927 0.270
Finished mdrun on rank 0 Mon Jun 18 07:42:59 2018
On Mon, Jun 18, 2018 at 3:23 PM, Szilárd Páll <pall.szilard at gmail.com>
wrote:
> On Mon, Jun 18, 2018 at 2:22 AM, Alex <nedomacho at gmail.com> wrote:
>
> > Thanks for the heads up. With the K40c instead of GTX 960 here's what I
> > did and here are the results:
> >
> > 1. Enabled persistence mode and overclocked the card via nvidia-smi:
> > http://acceleware.com/blog/gpu-boost-nvidias-tesla-k40-gpus
>
>
> Note that: persistence mode is only for convenience.
>
>
> > 2. Offloaded PME's FFT to GPU (which wasn't the case with GTX 960), this
> > brough the "pme mesh / force" ratio to something like 1.07.
> >
>
> I still think you are running multiple ranks which is unlikely to be ideal,
> but without seeing a log file, it's hard to tell..
>
> The result is a solid increase in performance on a small-ish system (20K
> > atoms): 90 ns/day instead of 65-70. I don't use this box for anything
> > except prototyping, but still the swap + tweaks were pretty useful.
>
>
> >
> > Alex
> >
> >
> >
> > On 6/15/2018 1:20 PM, Szilárd Páll wrote:
> >
> >> Hi,
> >>
> >> Regarding the K40 vs GTX 960 question, the K40 will likely be a bit
> >> faster (though it'l consume more power if that matters). The
> >> difference will be at most 20% in total performance, I think -- and
> >> with small systems likely negligible (as a smaller card with higher
> >> clocks is more efficient at small tasks than a large card with lower
> >> clocks).
> >>
> >> Regarding the load balance note, you are correct, the "pme mesh/force"
> >> means the ratio of time spent in computing PME forces on a separate
> >> task/rank and the rest of the forces (including nonbonded, bonded,
> >> etc.). With GPU offload this is a bit more tricky as the observed time
> >> is the time spent waiting for the GPU results, but the take-away is
> >> the same: when a run shows "pme mesh/force" far from 1, there is
> >> imbalance affecting performance.
> >>
> >> However, note that with a single GPU I've yet to see a case where you
> >> get better performance by running multiple ranks rather than simply
> >> running OpenMP-only. Also note that what a "weak GPU" can
> >> case-by-case, so I recommend taking the 1-2 minutes to do a short run
> >> and check for a certain hardware + simulation setup is it better to
> >> offload all of PME or keep the FFTs on the CPU.
> >>
> >> We'll do our best to automate more of these choices, but for now if
> >> you care about performance it's useful to test before doing long runs.
> >>
> >> Cheers,
> >> --
> >> Szilárd
> >>
> >>
> >> On Thu, Jun 14, 2018 at 2:09 AM, Alex <nedomacho at gmail.com> wrote:
> >>
> >>> Question: in the DD output (md.log) that looks like "DD step xxxxxx
> pme
> >>> mesh/force 1.229," what is the ratio? Does it mean the pme calculations
> >>> take longer by the shown factor than the nonbonded interactions?
> >>> With GTX 960, the ratio is consistently ~0.85, with Tesla K40 it's
> ~1.25.
> >>> My mdrun line contains -pmefft cpu (per Szilard's advice for weak
> GPUs,
> >>> I
> >>> believe). Would it then make sense to offload the fft to the K40?
> >>>
> >>> Thank you,
> >>>
> >>> Alex
> >>>
> >>> On Wed, Jun 13, 2018 at 4:53 PM, Alex <nedomacho at gmail.com> wrote:
> >>>
> >>> So, swap, then? Thank you!
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jun 13, 2018 at 4:49 PM, paul buscemi <pbuscemi at q.com> wrote:
> >>>>
> >>>> flops trumps clock speed…..
> >>>>>
> >>>>> On Jun 13, 2018, at 3:45 PM, Alex <nedomacho at gmail.com> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I have an old "prototyping" box with a 4-core Xeon and an old GTX
> 960.
> >>>>>>
> >>>>> We
> >>>>>
> >>>>>> have a Tesla K40 laying around and there's only one PCIE slot
> >>>>>> available
> >>>>>>
> >>>>> in
> >>>>>
> >>>>>> this machine. Would it make sense to swap the cards, or is it
> already
> >>>>>> bottlenecked by the CPU? I compared the specs and 960 has a higher
> >>>>>> clock
> >>>>>> speed, while K40's FP performance is better. Should I swap the GPUs?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>> --
> >>>>>> Gromacs Users mailing list
> >>>>>>
> >>>>>> * Please search the archive at http://www.gromacs.org/Support
> >>>>>>
> >>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>
> >>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>>>>
> >>>>>> * For (un)subscribe requests visit
> >>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >>>>>>
> >>>>> send a mail to gmx-users-request at gromacs.org.
> >>>>>
> >>>>> --
> >>>>> Gromacs Users mailing list
> >>>>>
> >>>>> * Please search the archive at http://www.gromacs.org/Support
> >>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>
> >>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>>>
> >>>>> * For (un)subscribe requests visit
> >>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> >>>>> send a mail to gmx-users-request at gromacs.org.
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>> Gromacs Users mailing list
> >>>
> >>> * Please search the archive at http://www.gromacs.org/Support
> >>> /Mailing_Lists/GMX-Users_List before posting!
> >>>
> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>
> >>> * For (un)subscribe requests visit
> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >>> send a mail to gmx-users-request at gromacs.org.
> >>>
> >>
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/Support
> > /Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list