mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

Sun Nov 10 11:00:48 CET 2013

On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman <mpi566 at gmail.com> wrote:

> Hi Szilard,
>
>  Thank you very much for your suggestions.
>
> >Actually, I was jumping to conclusions too early, as you mentioned AMD
> >"cluster", I assumed you must have 12-16-core Opteron CPUs. If you
> >have an 8-core (desktop?) AMD CPU, than you may not need to run more
> >than one rank per GPU.
>
> Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
> nodes of three clusters are  installed with (at least) 1 GPU card.   I have
> run the same test on these three clusters.
>
> Let's focus on a basic scaling issue:  One GPU  v.s Two GPUs within the
> same
> node of 8-core AMD cpu.
> Using 1 GPU, we  can  have a performance of ~32 ns/day.  Using two GPU, we
> gain not much more ( ~38.5 ns/day ).  It is about ~20% more performance.
> However, this is not really true because in some tests, I also saw only
> 2-5%
> more, which really surprised me.

Neither run had a PP-PME work distribution suitable for the hardware it was
running on (and fixing that for each run requires opposite changes). Adding
a GPU and hoping to see scaling requires that there be proportionately more
GPU work available to do, *and* enough absolute work to do. mdrun tries to
do this, and reports early in the log file, which is one of the reasons
Szilard asked to see whole log files - please use a file sharing service to
do that.

As you can see, this test was made on the same node regardless of
> networking.  Can the performance be improved  say 50% more when 2 GPUs are
> used on a general task ?  If yes, how ?
>
> >Indeed, as Richard pointed out, I was asking for *full* logs, these
> >summaries can't tell much, the table above the summary entitled "R E A
> >L   C Y C L E   A N D   T I M E   A C C O U N T I N G" as well as
> >other reported information across the log file is what I need to make
> >an assessment of your simulations' performance.
>
> Please see below.
>
> >>However, in your case I suspect that the
> >>bottleneck is multi-threaded scaling on the AMD CPUs and you should
> >>probably decrease the number of threads per MPI rank and share GPUs
> >>between 2-4 ranks.
>
> After I test all three clusters, I found it may NOT be an issue of AMD
> cpus.
> Intel cpus has the SAME scaling issue.
>
> However, I am curious as to how you justify the setup of 2-4 ranks sharing
> GPUs ? Can you please explain it a bit more ?
>

NUMA effects on multi-socket AMD processors are particularly severe; the
way GROMACS uses OpenMP is not well suited to them. Using a rank (or two)
per socket will greatly reduce those effects, but introduces different
algorithmic overhead from the need to do DD and explicitly communicate
between ranks. (You can see the latter in your .log file snippets below.)
Also, that means the parcel of PP work available from a rank to give to the
GPU is smaller, which is the opposite of what you'd like for GPU
performance and/or scaling. We are working on a general solution for this
and lots of related issues in the post-5.0 space, but there is a very hard
limitation imposed by the need to amortize the cost of CPU-GPU transfer by
having lots of PP work available to do.

>You could try running
> >mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
> >but I suspect this won't help because your scaling issue
>
> Your guess is correct but why is that ?  it is worse. The more nodes are
> involved in a task, the performance is worse.
>
>
> >> in my
> >>experience even reaction field runs don't scale across nodes with 10G
> >>ethernet if you have more than 4-6 ranks per node trying to
> >>communicate (let alone with PME).
>
> What dose it mean " let alone with PME" ?  how to do so ? by mdrun ?
> I do know " mdrun -npme to specify PME process.
>

If using PME (rather than RF), network demands are more severe.

> Thank you.
>
> Dwey
>
>
>
> ### One GPU ####
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
>
> -----------------------------------------------------------------------------
>  Neighbor search        1    8     100001     431.817    13863.390     1.6
>  Launch GPU ops.        1    8    5000001     472.906    15182.556     1.7
>  Force                  1    8    5000001    1328.611    42654.785     4.9
>  PME mesh               1    8    5000001   11561.327   371174.090    42.8
>  Wait GPU local         1    8    5000001    6888.008   221138.111    25.5
>  NB X/F buffer ops.     1    8    9900001    1216.499    39055.455     4.5
>  Write traj.            1    8       1030      12.741      409.039     0.0
>  Update                 1    8    5000001    1696.358    54461.226     6.3
>  Constraints            1    8    5000001    1969.726    63237.647     7.3
>  Rest                   1                    1458.820    46835.133     5.4
>
> -----------------------------------------------------------------------------
>  Total                  1                   27036.812   868011.431   100.0
>
> -----------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------
>  PME spread/gather      1    8   10000002    6975.086   223933.739    25.8
>  PME 3D-FFT             1    8   10000002    3928.259   126115.976    14.5
>  PME solve              1    8    5000001     636.488    20434.327     2.4
>
> -----------------------------------------------------------------------------
>  GPU timings
>
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
>
> -----------------------------------------------------------------------------
>  Pair list H2D                     100001      43.435        0.434     0.2
>  X / q H2D                        5000001     567.168        0.113     2.8
>  Nonbonded F kernel               4000000   14174.316        3.544    70.8
>  Nonbonded F+ene k.                900000    4314.438        4.794    21.5
>  Nonbonded F+ene+prune k.          100001     572.370        5.724     2.9
>  F D2H                            5000001     358.120        0.072     1.8
>
> -----------------------------------------------------------------------------
>  Total                                      20029.846        4.006   100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>       performance loss, consider using a shorter cut-off and a finer PME
> grid.
>

This note needs to be addressed before maximum throughput is achieved and
the question of scaling is worth considering. Ideally, "Wait GPU local"
should be nearly zero, achieved as suggested above. Note that
launch+force+mesh+wait is the sum of gpu total! But much of the information
needed is higher up the log file, and the whole question is constrained by
things like rvdw.

>                Core t (s)   Wall t (s)        (%)
>        Time:   216205.510    27036.812      799.7
>                          7h30:36
>                  (ns/day)    (hour/ns)
> Performance:       31.956        0.751
>
>
> ### Two GPUs #####
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
>
> -----------------------------------------------------------------------------
>  Domain decomp.         2    4     100000     339.490    10900.191     1.5
>  DD comm. load          2    4      49989       0.262        8.410     0.0
>  Neighbor search        2    4     100001     481.583    15462.464     2.2
>  Launch GPU ops.        2    4   10000002     579.283    18599.358     2.6
>  Comm. coord.           2    4    4900000     523.096    16795.351     2.3
>  Force                  2    4    5000001    1545.584    49624.951     6.9
>  Wait + Comm. F         2    4    5000001     821.740    26384.083     3.7
>  PME mesh               2    4    5000001   11097.880   356326.030    49.5
>  Wait GPU nonlocal      2    4    5000001    1001.868    32167.550     4.5
>  Wait GPU local         2    4    5000001       8.613      276.533     0.0
>  NB X/F buffer ops.     2    4   19800002    1061.238    34073.781     4.7
>  Write traj.            2    4       1025       5.681      182.419     0.0
>  Update                 2    4    5000001    1692.233    54333.503     7.6
>  Constraints            2    4    5000001    2316.145    74365.788    10.3
>  Comm. energies         2    4    1000001      15.802      507.373     0.1
>  Rest                   2                     908.383    29165.963     4.1
>
> -----------------------------------------------------------------------------
>  Total                  2                   22398.880   719173.747   100.0
>
> -----------------------------------------------------------------------------
>
> -----------------------------------------------------------------------------
>  PME redist. X/F        2    4   10000002    1519.288    48780.654     6.8
>  PME spread/gather      2    4   10000002    5398.693   173338.936    24.1
>  PME 3D-FFT             2    4   10000002    2798.482    89852.482    12.5
>  PME 3D-FFT Comm.       2    4   10000002     947.033    30406.937     4.2
>  PME solve              2    4    5000001     420.667    13506.611     1.9
>
> -----------------------------------------------------------------------------
>
>
Unfortunately you didn't copy the GPU timing stuff here! Roughly, all the
performance gain you are seeing here is eliminating most of the single-GPU
"wait gpu" term by throwing more hardware at it. To hope to see some
scaling, you'd need to be able to drop the PME mesh time by about a factor
of two (coarser grid, and compensating increase to rcoulomb), and hope
there was enough PP work that using two GPUs for a single simulation is
even worth considering. Achieving throughput-style scaling by running two
independent simulations on the same node may be all that is practical (but
I don't even know how many atoms you are simulating!)

Mark

>                Core t (s)   Wall t (s)        (%)
>        Time:   178961.450    22398.880      799.0
>                          6h13:18
>                  (ns/day)    (hour/ns)
> Performance:       38.573        0.622
>
>
>
>
>
>
> --
> View this message in context:
> http://gromacs.5086.x6.nabble.com/mdrun-on-8-core-AMD-GTX-TITAN-was-Re-gmx-users-Re-Gromacs-4-6-on-two-Titans-GPUs-tp5012330p5012391.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>