[gmx-users] using dual CPU's
paul buscemi
pbuscemi at q.com
Mon Dec 10 23:22:34 CET 2018
Mark,
I may have misread the ppt on optimization, but I did experiment with variations of mtomp mtmpi and so using less than si x threads was a 2 x 3 combination. Tonight I will put both
========================this is the last part of the log from a 2 gpu setup================
using gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 6 -gpu_id 1 -pin on. Run on the I7-970 cpu
NOTE: DLB can now turn on, when beneficial
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 2401 steps using 25 frames
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.21440e+05 1.96052e+04 6.53857e+04 2.23128e+02 8.65164e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84582e+07 -1.44895e+05 -2.04658e+03 1.34455e+07 5.03949e+04
Position Rest. Potential Kinetic En. Total Energy Temperature
3.44645e+01 -1.40160e+07 1.91196e+05 -1.38249e+07 3.04725e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 3.64550e+02 0.00000e+00
Total Virial (kJ/mol)
-8.80572e+04 -5.06693e+03 6.90580e+02
-5.06777e+03 -6.31180e+04 -5.32400e+03
6.90136e+02 -5.32396e+03 -5.27950e+04
Pressure (bar)
4.14166e+02 1.39915e+01 -1.79346e+00
1.39938e+01 3.54006e+02 1.44453e+01
-1.79223e+00 1.44452e+01 3.25476e+02
T-PDMS T-VMOS
2.98272e+02 6.83205e+02
P P - P M E L O A D B A L A N C I N G
NOTE: The PP/PME load balancing was limited by the maximum allowed grid scaling,
you might not have reached a good load balance.
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm
final 1.628 nm 1.628 nm 96 96 80 0.260 nm 0.521 nm
cost-ratio 4.31 0.23
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 225.527520 2029.748 0.0
NxN Ewald Elec. + LJ [F] 255071.893824 16834744.992 91.2
NxN Ewald Elec. + LJ [V&F] 2710.128064 289983.703 1.6
1,4 nonbonded interactions 432.540150 38928.613 0.2
Calc Weights 543.250260 19557.009 0.1
Spread Q Bspline 11589.338880 23178.678 0.1
Gather F Bspline 11589.338880 69536.033 0.4
3D-FFT 129115.579906 1032924.639 5.6
Solve PME 31.785216 2034.254 0.0
Reset In Box 1.885500 5.656 0.0
CG-CoM 1.960920 5.883 0.0
Angles 342.430620 57528.344 0.3
Propers 72.102030 16511.365 0.1
Impropers 0.432180 89.893 0.0
Pos. Restr. 3.457440 172.872 0.0
Virial 1.887750 33.979 0.0
Update 181.083420 5613.586 0.0
Stop-CM 1.960920 19.609 0.0
Calc-Ekin 3.771000 101.817 0.0
Lincs 375.988360 22559.302 0.1
Lincs-Mat 8530.590144 34122.361 0.2
Constraint-V 751.820250 6014.562 0.0
Constraint-Vir 1.956622 46.959 0.0
-----------------------------------------------------------------------------
Total 18455743.858 100.0
-----------------------------------------------------------------------------
D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
av. #atoms communicated per step for force: 2 x 6018.1
av. #atoms communicated per step for LINCS: 2 x 3015.7
Dynamic load balancing report:
DLB was off during the run due to low measured imbalance.
Average load imbalance: 0.9%.
The balanceable part of the MD step is 47%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 0.4%.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 2 MPI ranks, each using 6 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Domain decomp. 2 6 25 0.627 24.367 0.8
DD comm. load 2 6 2 0.000 0.004 0.0
Neighbor search 2 6 25 0.160 6.206 0.2
Launch GPU ops. 2 6 4802 0.516 20.048 0.7
Comm. coord. 2 6 2376 0.272 10.563 0.4
Force 2 6 2401 3.714 144.331 4.9
Wait + Comm. F 2 6 2401 0.210 8.173 0.3
PME mesh 2 6 2401 49.851 1937.315 66.2
Wait GPU NB nonloc. 2 6 2401 0.056 2.157 0.1
Wait GPU NB local 2 6 2401 0.033 1.285 0.0
NB X/F buffer ops. 2 6 9554 0.641 24.920 0.9
Write traj. 2 6 2 0.040 1.559 0.1
Update 2 6 4802 1.690 65.662 2.2
Constraints 2 6 4802 10.001 388.661 13.3
Comm. energies 2 6 25 0.003 0.107 0.0
Rest 7.511 291.885 10.0
-----------------------------------------------------------------------------
Total 75.323 2927.243 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME redist. X/F 2 6 4802 2.694 104.683 3.6
PME spread 2 6 2401 10.619 412.680 14.1
PME gather 2 6 2401 9.157 355.857 12.2
PME 3D-FFT 2 6 4802 21.805 847.398 28.9
PME 3D-FFT Comm. 2 6 4802 4.471 173.761 5.9
PME solve Elec 2 6 2401 1.067 41.480 1.4
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 903.878 75.323 1200.0
(ns/day) (hour/ns)
Performance: 2.754 8.714
Finished mdrun on rank 0 Sun Dec 9 20:36:30 2018
===============================================================================
============ end of log from a 1 gpu setup============================================
using: gmx mdrun -deffnm SR.sys.nvt -ntmpi 1 -ntomp 12 -gpu_id 01 -pin on Intel I7-970
step 1200: timed with pme grid 112 108 96, coulomb cutoff 1.395: 3000.4 M-cycles
Step Time
1200 1.20000
Writing checkpoint, step 1200 at Sun Dec 9 20:27:47 2018
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.21561e+05 1.42782e+04 6.60879e+04 2.04484e+02 8.39065e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84398e+07 -1.44481e+05 -2.04658e+03 1.34476e+07 3.82740e+04
Position Rest. Potential Kinetic En. Total Energy Temperature
3.92568e+01 -1.40143e+07 1.86727e+05 -1.38276e+07 2.97602e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 1.92481e+01 0.00000e+00
<====== ############### ==>
<==== A V E R A G E S ====>
<== ############### ======>
Statistics over 1201 steps using 13 frames
Energies (kJ/mol)
Angle G96Angle Proper Dih. Improper Dih. LJ-14
9.24025e+05 2.25759e+04 6.46951e+04 2.25055e+02 8.86630e+04
Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
-2.84705e+07 -1.45696e+05 -2.04658e+03 1.34231e+07 7.81266e+04
Position Rest. Potential Kinetic En. Total Energy Temperature
2.47925e+01 -1.40168e+07 1.93813e+05 -1.38230e+07 3.08896e+02
Pres. DC (bar) Pressure (bar) Constr. rmsd
-2.88685e+00 6.63095e+02 0.00000e+00
Total Virial (kJ/mol)
-2.04748e+05 -1.20971e+04 1.35853e+02
-1.20969e+04 -1.60243e+05 -1.17082e+04
1.35807e+02 -1.17081e+04 -1.59982e+05
Pressure (bar)
7.39235e+02 3.34709e+01 3.22280e-02
3.34703e+01 6.25486e+02 3.18788e+01
3.23543e-02 3.18787e+01 6.24566e+02
T-PDMS T-VMOS
2.96678e+02 1.02554e+03
P P - P M E L O A D B A L A N C I N G
PP/PME load balancing changed the cut-off and PME settings:
particle-particle PME
rcoulomb rlist grid spacing 1/beta
initial 1.000 nm 1.000 nm 160 160 128 0.156 nm 0.320 nm
final 1.389 nm 1.389 nm 120 108 96 0.222 nm 0.445 nm
cost-ratio 2.68 0.38
(note that these numbers concern only part of the total PP and PME load)
M E G A - F L O P S A C C O U N T I N G
NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
W3=SPC/TIP3p W4=TIP4p (single or pairs)
V&F=Potential and force V=Potential only F=Force only
Computing: M-Number M-Flops % Flops
-----------------------------------------------------------------------------
Pair Search distance check 113.300752 1019.707 0.0
NxN Ewald Elec. + LJ [F] 100343.174976 6622649.548 96.9
NxN Ewald Elec. + LJ [V&F] 1114.688448 119271.664 1.7
1,4 nonbonded interactions 216.360150 19472.413 0.3
Shift-X 0.980460 5.883 0.0
Angles 171.286620 28776.152 0.4
Propers 36.066030 8259.121 0.1
Impropers 0.216180 44.965 0.0
Pos. Restr. 1.729440 86.472 0.0
Virial 0.981045 17.659 0.0
Update 90.579420 2807.962 0.0
Stop-CM 1.055880 10.559 0.0
Calc-Ekin 1.960920 52.945 0.0
Lincs 181.093320 10865.599 0.2
Lincs-Mat 4114.301760 16457.207 0.2
Constraint-V 362.035980 2896.288 0.0
Constraint-Vir 0.979290 23.503 0.0
-----------------------------------------------------------------------------
Total 6832717.647 100.0
-----------------------------------------------------------------------------
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 12 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 12 13 0.163 6.350 1.0
Launch GPU ops. 1 12 2402 0.326 12.683 1.9
Force 1 12 1201 1.813 70.465 10.6
Wait PME GPU gather 1 12 1201 0.936 36.381 5.5
Reduce GPU PME F 1 12 1201 0.300 11.659 1.8
Wait GPU NB local 1 12 1201 2.156 83.786 12.6
NB X/F buffer ops. 1 12 2389 0.462 17.965 2.7
Write traj. 1 12 2 0.076 2.952 0.4
Update 1 12 2402 0.822 31.959 4.8
Constraints 1 12 2402 3.896 151.425 22.8
Rest 6.140 238.626 35.9
-----------------------------------------------------------------------------
Total 17.092 664.251 100.0
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 205.106 17.092 1200.0
(ns/day) (hour/ns)
Performance: 6.071 3.953
Finished mdrun on rank 0 Sun Dec 9 20:27:48 2018
========================================================================
I'll put the two cards in an i7- 7700 and report later tonight
Paul
On Dec 10 2018, at 3:53 pm, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> Hi,
>
> One of your reported runs only used six threads, by the way.
> Something sensible can be said when the performance report at the end of
> the log file can be seen.
>
> Mark
> On Tue., 11 Dec. 2018, 01:25 p buscemi, <pbuscemi at q.com> wrote:
> > Thank you, Mark, for the prompt response. I realize the limitations of the
> > system ( its over 8 yo ), but I did not expect the speed to decrease by 50%
> > with 12 available threads ! No combination of ntomp, ntmpi could raise
> > ns/day above 4 with two GPU, vs 6 with one GPU.
> >
> > This is actually a learning/practice run for a new build - an AMD 4.2 Ghz
> > 32 core TR, 64G ram. In this case I am trying to decide upon either a RTX
> > 2080 ti or two GTX 1080 TI. I'd prefer the two 1080's for the 7000 cores vs
> > the 4500 cores of the 2080. The model systems will have ~ million particles
> > and need the speed. But this is a major expense so I need to get it right.
> > I'll do as you suggest and report the results for both systems and I
> > really appreciate the assist.
> > Paul
> > UMN, BICB
> >
> > On Dec 9 2018, at 4:32 pm, paul buscemi <pbuscemi at q.com> wrote:
> > >
> > > Dear Users,
> > > I have good luck using a single GPU with the basic setup.. However in
> >
> > going from one gtx 1060 to a system with two - 50,000 atoms - the rate
> > decrease from 10 ns/day to 5 or worse. The system models a ligand, solvent
> > ( water ) and a lipid membrane
> > > the cpu is a 6 core intel i7 970( 12 threads ) , 750W PS, 16G Ram.
> > > with the basic command " mdrun I get:
> > > ck Off! I just backed up sys.nvt.log to ./#.sys.nvt.log.10#
> > > Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
> > > Changing nstlist from 10 to 100, rlist from 1 to 1
> > >
> > > Using 2 MPI threads
> > > Using 6 OpenMP threads per tMPI thread
> > >
> > > On host I7 2 GPUs auto-selected for this run.
> > > Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
> > > PP:0,PP:1
> > >
> > > Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.10#
> > > Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.10#
> > > NOTE: DLB will not turn on during the first phase of PME tuning
> > > starting mdrun 'SR-TA'
> > > 100000 steps, 100.0 ps.
> > > and ending with ^C
> > >
> > > Received the INT signal, stopping within 200 steps
> > > Dynamic load balancing report:
> > > DLB was locked at the end of the run due to unfinished PP-PME balancing.
> > > Average load imbalance: 0.7%.
> > > The balanceable part of the MD step is 46%, load imbalance is computed
> >
> > from this.
> > > Part of the total run time spent waiting due to load imbalance: 0.3%.
> > >
> > >
> > > Core t (s) Wall t (s) (%)
> > > Time: 543.475 45.290 1200.0
> > > (ns/day) (hour/ns)
> > > Performance: 1.719 13.963 before DBL is turned on
> > >
> > > Very poor performance. I have been following - or trying to follow -
> > "Performance Tuning and Optimization fo GROMACA ' M.Abraham andR Apsotolov
> > - 2016 but have not yet broken the code.
> > > ----------------
> > > gmx mdrun -deffnm SR.sys.nvt -ntmpi 2 -ntomp 3 -gpu_id 01 -pin on.
> > >
> > >
> > > Back Off! I just backed up SR.sys.nvt.log to ./#SR.sys.nvt.log.13#
> > > Reading file SR.sys.nvt.tpr, VERSION 2018.3 (single precision)
> > > Changing nstlist from 10 to 100, rlist from 1 to 1
> > >
> > > Using 2 MPI threads
> > > Using 3 OpenMP threads per tMPI thread
> > >
> > > On host I7 2 GPUs auto-selected for this run.
> > > Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
> > > PP:0,PP:1
> > >
> > > Back Off! I just backed up SR.sys.nvt.trr to ./#SR.sys.nvt.trr.13#
> > > Back Off! I just backed up SR.sys.nvt.edr to ./#SR.sys.nvt.edr.13#
> > > NOTE: DLB will not turn on during the first phase of PME tuning
> > > starting mdrun 'SR-TA'
> > > 100000 steps, 100.0 ps.
> > >
> > > NOTE: DLB can now turn on, when beneficial
> > > ^C
> > >
> > > Received the INT signal, stopping within 200 steps
> > > Dynamic load balancing report:
> > > DLB was off during the run due to low measured imbalance.
> > > Average load imbalance: 0.7%.
> > > The balanceable part of the MD step is 46%, load imbalance is computed
> >
> > from this.
> > > Part of the total run time spent waiting due to load imbalance: 0.3%.
> > >
> > >
> > > Core t (s) Wall t (s) (%)
> > > Time: 953.837 158.973 600.0
> > > (ns/day) (hour/ns)
> > > Performance: 2.935 8.176
> > >
> > > ====================
> > > the beginning of the log file is
> > > GROMACS version: 2018.3
> > > Precision: single
> > > Memory model: 64 bit
> > > MPI library: thread_mpi
> > > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> > > GPU support: CUDA
> > > SIMD instructions: SSE4.1
> > > FFT library: fftw-3.3.8-sse2
> > > RDTSCP usage: enabled
> > > TNG support: enabled
> > > Hwloc support: disabled
> > > Tracing support: disabled
> > > Built on: 2018-10-19 21:26:38
> > > Built by: pb at Q4 [CMAKE]
> > > Build OS/arch: Linux 4.15.0-20-generic x86_64
> > > Build CPU vendor: Intel
> > > Build CPU brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
> > > Build CPU family: 6 Model: 44 Stepping: 2
> > > Build CPU features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> >
> > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
> > sse4.2 ssse3
> > > C compiler: /usr/bin/gcc-6 GNU 6.4.0
> > > C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops
> >
> > -fexcess-precision=fast
> > > C++ compiler: /usr/bin/g++-6 GNU 6.4.0
> > > C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG -funroll-all-loops
> >
> > -fexcess-precision=fast
> > > CUDA compiler: /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> >
> > driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on
> > Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
> > > CUDA compiler
> >
> > flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;;
> > ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> > > CUDA driver: 9.10
> > > CUDA runtime: 9.10
> > >
> > >
> > > Running on 1 node with total 12 cores, 12 logical cores, 2 compatible
> > GPUs
> > > Hardware detected:
> > > CPU info:
> > > Vendor: Intel
> > > Brand: Intel(R) Core(TM) i7 CPU 970 @ 3.20GHz
> > > Family: 6 Model: 44 Stepping: 2
> > > Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> >
> > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
> > sse4.2 ssse3
> > > Hardware topology: Only logical processor count
> > > GPU info:
> > > Number of GPUs detected: 2
> > > #0: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat:
> >
> > compatible
> > > #1: NVIDIA GeForce GTX 1060 6GB, compute cap.: 6.1, ECC: no, stat:
> >
> > compatible
> > >
> > >
> > > There were no errors encountered during the runs. Suggestions would be
> > appreciated.
> > > Regards
> > > Paul
> > >
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list