mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)
Szilárd Páll
pall.szilard at gmail.com
Tue Nov 12 16:22:56 CET 2013
As Mark said, please share the *entire* log file. Among other
important things, the result of PP-PME tuning is not included above.
However, I suspect that in this case scaling is strongly affected or
by the small size of the system you are simulating.
--
Szilárd
On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman <mpi566 at gmail.com> wrote:
> Hi Szilard,
>
> Thank you very much for your suggestions.
>
>>Actually, I was jumping to conclusions too early, as you mentioned AMD
>>"cluster", I assumed you must have 12-16-core Opteron CPUs. If you
>>have an 8-core (desktop?) AMD CPU, than you may not need to run more
>>than one rank per GPU.
>
> Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
> nodes of three clusters are installed with (at least) 1 GPU card. I have
> run the same test on these three clusters.
>
> Let's focus on a basic scaling issue: One GPU v.s Two GPUs within the same
> node of 8-core AMD cpu.
> Using 1 GPU, we can have a performance of ~32 ns/day. Using two GPU, we
> gain not much more ( ~38.5 ns/day ). It is about ~20% more performance.
> However, this is not really true because in some tests, I also saw only 2-5%
> more, which really surprised me.
>
> As you can see, this test was made on the same node regardless of
> networking. Can the performance be improved say 50% more when 2 GPUs are
> used on a general task ? If yes, how ?
>
>>Indeed, as Richard pointed out, I was asking for *full* logs, these
>>summaries can't tell much, the table above the summary entitled "R E A
>>L C Y C L E A N D T I M E A C C O U N T I N G" as well as
>>other reported information across the log file is what I need to make
>>an assessment of your simulations' performance.
>
> Please see below.
>
>>>However, in your case I suspect that the
>>>bottleneck is multi-threaded scaling on the AMD CPUs and you should
>>>probably decrease the number of threads per MPI rank and share GPUs
>>>between 2-4 ranks.
>
> After I test all three clusters, I found it may NOT be an issue of AMD cpus.
> Intel cpus has the SAME scaling issue.
>
> However, I am curious as to how you justify the setup of 2-4 ranks sharing
> GPUs ? Can you please explain it a bit more ?
>
>
>>You could try running
>>mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
>>but I suspect this won't help because your scaling issue
>
> Your guess is correct but why is that ? it is worse. The more nodes are
> involved in a task, the performance is worse.
>
>
>>> in my
>>>experience even reaction field runs don't scale across nodes with 10G
>>>ethernet if you have more than 4-6 ranks per node trying to
>>>communicate (let alone with PME).
>
> What dose it mean " let alone with PME" ? how to do so ? by mdrun ?
> I do know " mdrun -npme to specify PME process.
>
> Thank you.
>
> Dwey
>
>
>
> ### One GPU ####
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles %
> -----------------------------------------------------------------------------
> Neighbor search 1 8 100001 431.817 13863.390 1.6
> Launch GPU ops. 1 8 5000001 472.906 15182.556 1.7
> Force 1 8 5000001 1328.611 42654.785 4.9
> PME mesh 1 8 5000001 11561.327 371174.090 42.8
> Wait GPU local 1 8 5000001 6888.008 221138.111 25.5
> NB X/F buffer ops. 1 8 9900001 1216.499 39055.455 4.5
> Write traj. 1 8 1030 12.741 409.039 0.0
> Update 1 8 5000001 1696.358 54461.226 6.3
> Constraints 1 8 5000001 1969.726 63237.647 7.3
> Rest 1 1458.820 46835.133 5.4
> -----------------------------------------------------------------------------
> Total 1 27036.812 868011.431 100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> PME spread/gather 1 8 10000002 6975.086 223933.739 25.8
> PME 3D-FFT 1 8 10000002 3928.259 126115.976 14.5
> PME solve 1 8 5000001 636.488 20434.327 2.4
> -----------------------------------------------------------------------------
> GPU timings
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
> -----------------------------------------------------------------------------
> Pair list H2D 100001 43.435 0.434 0.2
> X / q H2D 5000001 567.168 0.113 2.8
> Nonbonded F kernel 4000000 14174.316 3.544 70.8
> Nonbonded F+ene k. 900000 4314.438 4.794 21.5
> Nonbonded F+ene+prune k. 100001 572.370 5.724 2.9
> F D2H 5000001 358.120 0.072 1.8
> -----------------------------------------------------------------------------
> Total 20029.846 4.006 100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
> performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
> Core t (s) Wall t (s) (%)
> Time: 216205.510 27036.812 799.7
> 7h30:36
> (ns/day) (hour/ns)
> Performance: 31.956 0.751
>
>
> ### Two GPUs #####
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Th. Count Wall t (s) G-Cycles %
> -----------------------------------------------------------------------------
> Domain decomp. 2 4 100000 339.490 10900.191 1.5
> DD comm. load 2 4 49989 0.262 8.410 0.0
> Neighbor search 2 4 100001 481.583 15462.464 2.2
> Launch GPU ops. 2 4 10000002 579.283 18599.358 2.6
> Comm. coord. 2 4 4900000 523.096 16795.351 2.3
> Force 2 4 5000001 1545.584 49624.951 6.9
> Wait + Comm. F 2 4 5000001 821.740 26384.083 3.7
> PME mesh 2 4 5000001 11097.880 356326.030 49.5
> Wait GPU nonlocal 2 4 5000001 1001.868 32167.550 4.5
> Wait GPU local 2 4 5000001 8.613 276.533 0.0
> NB X/F buffer ops. 2 4 19800002 1061.238 34073.781 4.7
> Write traj. 2 4 1025 5.681 182.419 0.0
> Update 2 4 5000001 1692.233 54333.503 7.6
> Constraints 2 4 5000001 2316.145 74365.788 10.3
> Comm. energies 2 4 1000001 15.802 507.373 0.1
> Rest 2 908.383 29165.963 4.1
> -----------------------------------------------------------------------------
> Total 2 22398.880 719173.747 100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> PME redist. X/F 2 4 10000002 1519.288 48780.654 6.8
> PME spread/gather 2 4 10000002 5398.693 173338.936 24.1
> PME 3D-FFT 2 4 10000002 2798.482 89852.482 12.5
> PME 3D-FFT Comm. 2 4 10000002 947.033 30406.937 4.2
> PME solve 2 4 5000001 420.667 13506.611 1.9
> -----------------------------------------------------------------------------
>
> Core t (s) Wall t (s) (%)
> Time: 178961.450 22398.880 799.0
> 6h13:18
> (ns/day) (hour/ns)
> Performance: 38.573 0.622
>
>
>
>
>
>
> --
> View this message in context: http://gromacs.5086.x6.nabble.com/mdrun-on-8-core-AMD-GTX-TITAN-was-Re-gmx-users-Re-Gromacs-4-6-on-two-Titans-GPUs-tp5012330p5012391.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users
mailing list