mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)
Mark Abraham
mark.j.abraham at gmail.com
Sun Nov 10 11:00:48 CET 2013
On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman <mpi566 at gmail.com> wrote:
> Hi Szilard,
>
> Thank you very much for your suggestions.
>
> >Actually, I was jumping to conclusions too early, as you mentioned AMD
> >"cluster", I assumed you must have 12-16-core Opteron CPUs. If you
> >have an 8-core (desktop?) AMD CPU, than you may not need to run more
> >than one rank per GPU.
>
> Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
> nodes of three clusters are installed with (at least) 1 GPU card. I have
> run the same test on these three clusters.
>
> Let's focus on a basic scaling issue: One GPU v.s Two GPUs within the
> same
> node of 8-core AMD cpu.
> Using 1 GPU, we can have a performance of ~32 ns/day. Using two GPU, we
> gain not much more ( ~38.5 ns/day ). It is about ~20% more performance.
> However, this is not really true because in some tests, I also saw only
> 2-5%
> more, which really surprised me.
Neither run had a PP-PME work distribution suitable for the hardware it was
running on (and fixing that for each run requires opposite changes). Adding
a GPU and hoping to see scaling requires that there be proportionately more
GPU work available to do, *and* enough absolute work to do. mdrun tries to
do this, and reports early in the log file, which is one of the reasons
Szilard asked to see whole log files - please use a file sharing service to
do that.
As you can see, this test was made on the same node regardless of
> networking. Can the performance be improved say 50% more when 2 GPUs are
> used on a general task ? If yes, how ?
>
> >Indeed, as Richard pointed out, I was asking for *full* logs, these
> >summaries can't tell much, the table above the summary entitled "R E A
> >L C Y C L E A N D T I M E A C C O U N T I N G" as well as
> >other reported information across the log file is what I need to make
> >an assessment of your simulations' performance.
>
> Please see below.
>
> >>However, in your case I suspect that the
> >>bottleneck is multi-threaded scaling on the AMD CPUs and you should
> >>probably decrease the number of threads per MPI rank and share GPUs
> >>between 2-4 ranks.
>
> After I test all three clusters, I found it may NOT be an issue of AMD
> cpus.
> Intel cpus has the SAME scaling issue.
>
> However, I am curious as to how you justify the setup of 2-4 ranks sharing
> GPUs ? Can you please explain it a bit more ?
>
NUMA effects on multi-socket AMD processors are particularly severe; the
way GROMACS uses OpenMP is not well suited to them. Using a rank (or two)
per socket will greatly reduce those effects, but introduces different
algorithmic overhead from the need to do DD and explicitly communicate
between ranks. (You can see the latter in your .log file snippets below.)
Also, that means the parcel of PP work available from a rank to give to the
GPU is smaller, which is the opposite of what you'd like for GPU
performance and/or scaling. We are working on a general solution for this
and lots of related issues in the post-5.0 space, but there is a very hard
limitation imposed by the need to amortize the cost of CPU-GPU transfer by
having lots of PP work available to do.
>You could try running
> >mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
> >but I suspect this won't help because your scaling issue
>
> Your guess is correct but why is that ? it is worse. The more nodes are
> involved in a task, the performance is worse.
>
>
> >> in my
> >>experience even reaction field runs don't scale across nodes with 10G
> >>ethernet if you have more than 4-6 ranks per node trying to
> >>communicate (let alone with PME).
>
> What dose it mean " let alone with PME" ? how to do so ? by mdrun ?
> I do know " mdrun -npme to specify PME process.
>
If using PME (rather than RF), network demands are more severe.
> Thank you.
>
> Dwey
>
>
>
> ### One GPU ####
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles %
>
> -----------------------------------------------------------------------------
> Neighbor search 1 8 100001 431.817 13863.390 1.6
> Launch GPU ops. 1 8 5000001 472.906 15182.556 1.7
> Force 1 8 5000001 1328.611 42654.785 4.9
> PME mesh 1 8 5000001 11561.327 371174.090 42.8
> Wait GPU local 1 8 5000001 6888.008 221138.111 25.5
> NB X/F buffer ops. 1 8 9900001 1216.499 39055.455 4.5
> Write traj. 1 8 1030 12.741 409.039 0.0
> Update 1 8 5000001 1696.358 54461.226 6.3
> Constraints 1 8 5000001 1969.726 63237.647 7.3
> Rest 1 1458.820 46835.133 5.4
>
> -----------------------------------------------------------------------------
> Total 1 27036.812 868011.431 100.0
>
> -----------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------
> PME spread/gather 1 8 10000002 6975.086 223933.739 25.8
> PME 3D-FFT 1 8 10000002 3928.259 126115.976 14.5
> PME solve 1 8 5000001 636.488 20434.327 2.4
>
> -----------------------------------------------------------------------------
> GPU timings
>
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
>
> -----------------------------------------------------------------------------
> Pair list H2D 100001 43.435 0.434 0.2
> X / q H2D 5000001 567.168 0.113 2.8
> Nonbonded F kernel 4000000 14174.316 3.544 70.8
> Nonbonded F+ene k. 900000 4314.438 4.794 21.5
> Nonbonded F+ene+prune k. 100001 572.370 5.724 2.9
> F D2H 5000001 358.120 0.072 1.8
>
> -----------------------------------------------------------------------------
> Total 20029.846 4.006 100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
> performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
This note needs to be addressed before maximum throughput is achieved and
the question of scaling is worth considering. Ideally, "Wait GPU local"
should be nearly zero, achieved as suggested above. Note that
launch+force+mesh+wait is the sum of gpu total! But much of the information
needed is higher up the log file, and the whole question is constrained by
things like rvdw.
> Core t (s) Wall t (s) (%)
> Time: 216205.510 27036.812 799.7
> 7h30:36
> (ns/day) (hour/ns)
> Performance: 31.956 0.751
>
>
> ### Two GPUs #####
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles %
>
> -----------------------------------------------------------------------------
> Domain decomp. 2 4 100000 339.490 10900.191 1.5
> DD comm. load 2 4 49989 0.262 8.410 0.0
> Neighbor search 2 4 100001 481.583 15462.464 2.2
> Launch GPU ops. 2 4 10000002 579.283 18599.358 2.6
> Comm. coord. 2 4 4900000 523.096 16795.351 2.3
> Force 2 4 5000001 1545.584 49624.951 6.9
> Wait + Comm. F 2 4 5000001 821.740 26384.083 3.7
> PME mesh 2 4 5000001 11097.880 356326.030 49.5
> Wait GPU nonlocal 2 4 5000001 1001.868 32167.550 4.5
> Wait GPU local 2 4 5000001 8.613 276.533 0.0
> NB X/F buffer ops. 2 4 19800002 1061.238 34073.781 4.7
> Write traj. 2 4 1025 5.681 182.419 0.0
> Update 2 4 5000001 1692.233 54333.503 7.6
> Constraints 2 4 5000001 2316.145 74365.788 10.3
> Comm. energies 2 4 1000001 15.802 507.373 0.1
> Rest 2 908.383 29165.963 4.1
>
> -----------------------------------------------------------------------------
> Total 2 22398.880 719173.747 100.0
>
> -----------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------
> PME redist. X/F 2 4 10000002 1519.288 48780.654 6.8
> PME spread/gather 2 4 10000002 5398.693 173338.936 24.1
> PME 3D-FFT 2 4 10000002 2798.482 89852.482 12.5
> PME 3D-FFT Comm. 2 4 10000002 947.033 30406.937 4.2
> PME solve 2 4 5000001 420.667 13506.611 1.9
>
> -----------------------------------------------------------------------------
>
>
Unfortunately you didn't copy the GPU timing stuff here! Roughly, all the
performance gain you are seeing here is eliminating most of the single-GPU
"wait gpu" term by throwing more hardware at it. To hope to see some
scaling, you'd need to be able to drop the PME mesh time by about a factor
of two (coarser grid, and compensating increase to rcoulomb), and hope
there was enough PP work that using two GPUs for a single simulation is
even worth considering. Achieving throughput-style scaling by running two
independent simulations on the same node may be all that is practical (but
I don't even know how many atoms you are simulating!)
Mark
> Core t (s) Wall t (s) (%)
> Time: 178961.450 22398.880 799.0
> 6h13:18
> (ns/day) (hour/ns)
> Performance: 38.573 0.622
>
>
>
>
>
>
> --
> View this message in context:
> http://gromacs.5086.x6.nabble.com/mdrun-on-8-core-AMD-GTX-TITAN-was-Re-gmx-users-Re-Gromacs-4-6-on-two-Titans-GPUs-tp5012330p5012391.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list