mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

Tue Nov 12 16:22:56 CET 2013

As Mark said, please share the *entire* log file. Among other
important things, the result of PP-PME tuning is not included above.

However, I suspect that in this case scaling is strongly affected or
by the small size of the system you are simulating.
--
Szilárd

On Sun, Nov 10, 2013 at 5:28 AM, Dwey Kauffman <mpi566 at gmail.com> wrote:
> Hi Szilard,
>
>  Thank you very much for your suggestions.
>
>>Actually, I was jumping to conclusions too early, as you mentioned AMD
>>"cluster", I assumed you must have 12-16-core Opteron CPUs. If you
>>have an 8-core (desktop?) AMD CPU, than you may not need to run more
>>than one rank per GPU.
>
> Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All
> nodes of three clusters are  installed with (at least) 1 GPU card.   I have
> run the same test on these three clusters.
>
> Let's focus on a basic scaling issue:  One GPU  v.s Two GPUs within the same
> node of 8-core AMD cpu.
> Using 1 GPU, we  can  have a performance of ~32 ns/day.  Using two GPU, we
> gain not much more ( ~38.5 ns/day ).  It is about ~20% more performance.
> However, this is not really true because in some tests, I also saw only 2-5%
> more, which really surprised me.
>
> As you can see, this test was made on the same node regardless of
> networking.  Can the performance be improved  say 50% more when 2 GPUs are
> used on a general task ?  If yes, how ?
>
>>Indeed, as Richard pointed out, I was asking for *full* logs, these
>>summaries can't tell much, the table above the summary entitled "R E A
>>L   C Y C L E   A N D   T I M E   A C C O U N T I N G" as well as
>>other reported information across the log file is what I need to make
>>an assessment of your simulations' performance.
>
> Please see below.
>
>>>However, in your case I suspect that the
>>>bottleneck is multi-threaded scaling on the AMD CPUs and you should
>>>probably decrease the number of threads per MPI rank and share GPUs
>>>between 2-4 ranks.
>
> After I test all three clusters, I found it may NOT be an issue of AMD cpus.
> Intel cpus has the SAME scaling issue.
>
> However, I am curious as to how you justify the setup of 2-4 ranks sharing
> GPUs ? Can you please explain it a bit more ?
>
>
>>You could try running
>>mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011
>>but I suspect this won't help because your scaling issue
>
> Your guess is correct but why is that ?  it is worse. The more nodes are
> involved in a task, the performance is worse.
>
>
>>> in my
>>>experience even reaction field runs don't scale across nodes with 10G
>>>ethernet if you have more than 4-6 ranks per node trying to
>>>communicate (let alone with PME).
>
> What dose it mean " let alone with PME" ?  how to do so ? by mdrun ?
> I do know " mdrun -npme to specify PME process.
>
> Thank you.
>
> Dwey
>
>
>
> ### One GPU ####
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
> -----------------------------------------------------------------------------
>  Neighbor search        1    8     100001     431.817    13863.390     1.6
>  Launch GPU ops.        1    8    5000001     472.906    15182.556     1.7
>  Force                  1    8    5000001    1328.611    42654.785     4.9
>  PME mesh               1    8    5000001   11561.327   371174.090    42.8
>  Wait GPU local         1    8    5000001    6888.008   221138.111    25.5
>  NB X/F buffer ops.     1    8    9900001    1216.499    39055.455     4.5
>  Write traj.            1    8       1030      12.741      409.039     0.0
>  Update                 1    8    5000001    1696.358    54461.226     6.3
>  Constraints            1    8    5000001    1969.726    63237.647     7.3
>  Rest                   1                    1458.820    46835.133     5.4
> -----------------------------------------------------------------------------
>  Total                  1                   27036.812   868011.431   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME spread/gather      1    8   10000002    6975.086   223933.739    25.8
>  PME 3D-FFT             1    8   10000002    3928.259   126115.976    14.5
>  PME solve              1    8    5000001     636.488    20434.327     2.4
> -----------------------------------------------------------------------------
>  GPU timings
> -----------------------------------------------------------------------------
>  Computing:                         Count  Wall t (s)      ms/step       %
> -----------------------------------------------------------------------------
>  Pair list H2D                     100001      43.435        0.434     0.2
>  X / q H2D                        5000001     567.168        0.113     2.8
>  Nonbonded F kernel               4000000   14174.316        3.544    70.8
>  Nonbonded F+ene k.                900000    4314.438        4.794    21.5
>  Nonbonded F+ene+prune k.          100001     572.370        5.724     2.9
>  F D2H                            5000001     358.120        0.072     1.8
> -----------------------------------------------------------------------------
>  Total                                      20029.846        4.006   100.0
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>       performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
>                Core t (s)   Wall t (s)        (%)
>        Time:   216205.510    27036.812      799.7
>                          7h30:36
>                  (ns/day)    (hour/ns)
> Performance:       31.956        0.751
>
>
> ### Two GPUs #####
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
> -----------------------------------------------------------------------------
>  Domain decomp.         2    4     100000     339.490    10900.191     1.5
>  DD comm. load          2    4      49989       0.262        8.410     0.0
>  Neighbor search        2    4     100001     481.583    15462.464     2.2
>  Launch GPU ops.        2    4   10000002     579.283    18599.358     2.6
>  Comm. coord.           2    4    4900000     523.096    16795.351     2.3
>  Force                  2    4    5000001    1545.584    49624.951     6.9
>  Wait + Comm. F         2    4    5000001     821.740    26384.083     3.7
>  PME mesh               2    4    5000001   11097.880   356326.030    49.5
>  Wait GPU nonlocal      2    4    5000001    1001.868    32167.550     4.5
>  Wait GPU local         2    4    5000001       8.613      276.533     0.0
>  NB X/F buffer ops.     2    4   19800002    1061.238    34073.781     4.7
>  Write traj.            2    4       1025       5.681      182.419     0.0
>  Update                 2    4    5000001    1692.233    54333.503     7.6
>  Constraints            2    4    5000001    2316.145    74365.788    10.3
>  Comm. energies         2    4    1000001      15.802      507.373     0.1
>  Rest                   2                     908.383    29165.963     4.1
> -----------------------------------------------------------------------------
>  Total                  2                   22398.880   719173.747   100.0
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
>  PME redist. X/F        2    4   10000002    1519.288    48780.654     6.8
>  PME spread/gather      2    4   10000002    5398.693   173338.936    24.1
>  PME 3D-FFT             2    4   10000002    2798.482    89852.482    12.5
>  PME 3D-FFT Comm.       2    4   10000002     947.033    30406.937     4.2
>  PME solve              2    4    5000001     420.667    13506.611     1.9
> -----------------------------------------------------------------------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:   178961.450    22398.880      799.0
>                          6h13:18
>                  (ns/day)    (hour/ns)
> Performance:       38.573        0.622
>
>
>
>
>
>
> --
> View this message in context: http://gromacs.5086.x6.nabble.com/mdrun-on-8-core-AMD-GTX-TITAN-was-Re-gmx-users-Re-Gromacs-4-6-on-two-Titans-GPUs-tp5012330p5012391.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists