[gmx-users] Can we set the number of pure PME nodes when using GPU&CPU?

Mon Aug 25 15:44:40 CEST 2014

On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>
>> Hi,
>>
>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>> 2572&authkey=!AP82sTNxS6MHgUk&ithint=file%2clog
>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>> 2482&authkey=!APLkizOBzXtPHxs&ithint=file%2clog
>>
>> These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
>> nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
>> When we look at the 64 cores log file, we find that in the  R E A L   C Y
>> C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time is
>> the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
>> that when the CPUs is doing PME, GPUs are doing nothing. That's why we say
>> they are working sequentially.
>>
>
> Please note that "sequential" means "one phase after another." Your log
> files don't show the timing breakdown for the GPUs, which is distinct from
> showing that the GPUs ran and then the CPUs ran (which I don't think the
> code even permits!). References to "CUDA 8x8 kernels" do show the GPU was
> active. There was an issue with mdrun not always being able to gather and
> publish the GPU timing results; I don't recall the conditions (Szilard
> might remember), but it might be fixed in a later release.

It is a limitation (well, I'd say borderline bug) in CUDA that if you
have multiple work-queues (=streams), reliable timing using the CUDA
built-in mechanisms is impossible. There may be a way to work around
this, but that won't happen in the current versions. What's important
is to observe the wait time on the CPU sideand of course, if the Op is
profiling this is not an issue.

> In any case, you
> should probably be doing performance optimization on a GROMACS version that
> isn't a year old.
>
> I gather that you didn't actually observe the GPUs idle - e.g. with a
> performance monitoring tool? Otherwise, and in the absence of a description
> of your simulation system, I'd say that log file looks somewhere between
> normal and optimal. For the record, for better performance, you should
> probably be following the advice of the install guide and not compiling
> FFTW with AVX support, and using one of the five gcc minor versions
> released since 4.4 ;-)

And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
(which you can use because you have version 5.5 driver which I see in
your log file):

Additionally, I suggest avoiding MKL and using FFTW instead. For the
grid sizes of our interest all benchmarks I did in the past showed
considerably higher FFTW performance. Same goes for icc, but feel free
to benchmark and please report back if you find the opposite.

> As for the 512 cores log file, the total wall time is approximately the sum
>> of PME mesh and PME wait for PP. We think this is because PME-dedicated
>> nodes finished early, and the total wall time is the time spent on PP
>> nodes, therefore time spent on PME is covered.
>
>
> Yes, using an offload model makes it awkward to report CPU timings, because
> there are two kinds of CPU ranks. The total of the "Wall t" column adds up
> to twice the total time taken (which is noted explicitly in more recent
> mdrun versions). By design, the PME ranks do finish early, as you know from
> Figure 3.16 of the manual. As you can see in the table, the PP ranks spend
> 26% of their time waiting for the results from the PME ranks, and this is
> the origin of the note (above the table) that you might want to balance
> things better.
>
> Mark
>
> On 8/23/2014 9:30 PM, Mark Abraham wrote:
>>
>>> On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>
>>>  Hi,
>>>>
>>>> When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
>>>> no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
>>>> idle,
>>>>
>>>
>>> That could happen if the GPU completes its work too fast, in which case
>>> the
>>> end of the log file will probably scream about imbalance.
>>>
>>> that is they are doing their work sequentially.
>>>
>>>
>>> Highly unlikely, not least because the code is written to overlap the
>>> short-range work on the GPU with everything else on the CPU. What's your
>>> evidence for *sequential* rather than *imbalanced*?
>>>
>>>
>>>  Is it supposed to be so?
>>>>
>>>
>>> No, but without seeing your .log files, mdrun command lines and knowing
>>> about your hardware, there's nothing we can say.
>>>
>>>
>>>  Is it the same reason as GPUs on PME-dedicated nodes won't be used during
>>>> a run like you said before?
>>>>
>>>
>>> Why would you suppose that? I said GPUs do work from the PP ranks on their
>>> node. That's true here.
>>>
>>> So if we want to exploit our hardware, we should map PP-PME ranks
>>> manually,
>>>
>>>> right? Say, use one node as PME-dedicated node and leave the GPUs on that
>>>> node idle, and use two nodes to do the other stuff. How do you think
>>>> about
>>>> this arrangement?
>>>>
>>>>  Probably a terrible idea. You should identify the cause of the
>>> imbalance,
>>> and fix that.
>>>
>>> Mark
>>>
>>>
>>>  Theo
>>>>
>>>>
>>>> On 8/22/2014 7:20 PM, Mark Abraham wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> Because no work will be sent to them. The GPU implementation can
>>>>> accelerate
>>>>> domains from PP ranks on their node, but with an MPMD setup that uses
>>>>> dedicated PME nodes, there will be no PP ranks on nodes that have been
>>>>> set
>>>>> up with only PME ranks. The two offload models (PP work -> GPU; PME work
>>>>> ->
>>>>> CPU subset) do not work well together, as I said.
>>>>>
>>>>> One can devise various schemes in 4.6/5.0 that could use those GPUs, but
>>>>> they either require
>>>>> * each node does both PME and PP work (thus limiting scaling because of
>>>>> the
>>>>> all-to-all for PME, and perhaps making poor use of locality on
>>>>> multi-socket
>>>>> nodes), or
>>>>> * that all nodes have PP ranks, but only some have PME ranks, and the
>>>>> nodes
>>>>> map their GPUs to PP ranks in a way that is different depending on
>>>>> whether
>>>>> PME ranks are present (which could work well, but relies on the DD
>>>>> load-balancer recognizing and taking advantage of the faster progress of
>>>>> the PP ranks that have better GPU support, and requires that you get
>>>>> very
>>>>> dirty hands laying out PP and PME ranks onto hardware that will later
>>>>> match
>>>>> the requirements of the DD load balancer, and probably that you balance
>>>>> PP-PME load manually)
>>>>>
>>>>> I do not recommend the last approach, because of its complexity.
>>>>>
>>>>> Clearly there are design decisions to improve. Work is underway.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Hi Mark,
>>>>>
>>>>>> Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
>>>>>> nodes, the GPU on such nodes will be idle?
>>>>>>
>>>>>>
>>>>>> Theo
>>>>>>
>>>>>> On 8/11/2014 9:36 PM, Mark Abraham wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>> What Carsten said, if running on nodes that have GPUs.
>>>>>>>
>>>>>>> If running on a mixed setup (some nodes with GPU, some not), then
>>>>>>> arranging
>>>>>>> your MPI environment to place PME ranks on CPU-only nodes is probably
>>>>>>> worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
>>>>>>> then
>>>>>>> all your PME ranks, mapped to CPU-only nodes, and then use mdrun
>>>>>>> -ddorder
>>>>>>> pp_pme.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    Hi Mark,
>>>>>>>
>>>>>>>  This is information of our cluster, could you give us some advice as
>>>>>>>> regards to our cluster so that we can make GMX run faster on our
>>>>>>>> system?
>>>>>>>>
>>>>>>>> Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia
>>>>>>>> K20M
>>>>>>>>
>>>>>>>>
>>>>>>>> Device Name     Device Type     Specifications  Number
>>>>>>>> CPU Node        IntelH2216JFFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>>>> Cores,
>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     332
>>>>>>>> Fat Node        IntelH2216WPFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>>>> Cores,
>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>> Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory    20
>>>>>>>> GPU Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>>>> Cores,
>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     50
>>>>>>>> MIC Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>>>> Cores,
>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     5
>>>>>>>> Computing Network Switch        Mellanox Infiniband FDR Core Switch
>>>>>>>> 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
>>>>>>>>   1
>>>>>>>> Mellanox SX1036 40Gb Switch     36× 40Gb Ethernet Switch SX1036, 36×
>>>>>>>> QSFP
>>>>>>>> Interface     1
>>>>>>>> Management Network Switch       Extreme Summit X440-48t-10G 2-layer
>>>>>>>> Switch
>>>>>>>> 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS
>>>>>>>>  9
>>>>>>>> Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
>>>>>>>> Switch
>>>>>>>> Summit X650-24X, authorized by ExtremeXOS    1
>>>>>>>> Parallel Storage        DDN Parallel Storage System     DDN SFA12K
>>>>>>>> Storage
>>>>>>>> System       1
>>>>>>>> GPU     GPU Accelerator         NVIDIA Tesla Kepler K20M        70
>>>>>>>> MIC     MIC     Intel Xeon Phi 5110P Knights Corner     10
>>>>>>>> 40Gb Ethernet Card      MCX314A-BCBT    Mellanox ConnextX-3 Chip 40Gb
>>>>>>>> Ethernet Card
>>>>>>>> 2× 40Gb Ethernet ports, enough QSFP cables      16
>>>>>>>> SSD     Intel SSD910    Intel SSD910 Disk, 400GB, PCIE  80
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/10/2014 5:50 AM, Mark Abraham wrote:
>>>>>>>>
>>>>>>>>    That's not what I said.... "You can set..."
>>>>>>>>
>>>>>>>>  -npme behaves the same whether or not GPUs are in use. Using
>>>>>>>>> separate
>>>>>>>>> ranks
>>>>>>>>> for PME caters to trying to minimize the cost of the all-to-all
>>>>>>>>> communication of the 3DFFT. That's still relevant when using GPUs,
>>>>>>>>> but
>>>>>>>>> if
>>>>>>>>> separate PME ranks are used, any GPUs on nodes that only have PME
>>>>>>>>> ranks
>>>>>>>>> are
>>>>>>>>> left idle. The most effective approach depends critically on the
>>>>>>>>> hardware
>>>>>>>>> and simulation setup, and whether you pay money for your hardware.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>     Hi,
>>>>>>>>>
>>>>>>>>>   You mean no matter we use GPU acceleration or not, -npme is just a
>>>>>>>>>
>>>>>>>>>> reference?
>>>>>>>>>> Why we can't set that to a exact value?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/9/2014 5:14 AM, Mark Abraham wrote:
>>>>>>>>>>
>>>>>>>>>>     You can set the number of PME-only ranks with -npme. Whether
>>>>>>>>>> it's
>>>>>>>>>> useful
>>>>>>>>>>
>>>>>>>>>>   is
>>>>>>>>>>
>>>>>>>>>>> another matter :-) The CPU-based PME offload and the GPU-based PP
>>>>>>>>>>> offload
>>>>>>>>>>> do not combine very well.
>>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>      Hi,
>>>>>>>>>>>
>>>>>>>>>>>    Can we set the number manually with -npme when using GPU
>>>>>>>>>>>
>>>>>>>>>>>  acceleration?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>
>>>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>
>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>>>>>>>>>>>> gmx-users
>>>>>>>>>>>> or
>>>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>     --
>>>>>>>>>>>>
>>>>>>>>>>>>   Gromacs Users mailing list
>>>>>>>>>>>>
>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    --
>>>>>>>>>>
>>>>>>>>>>  Gromacs Users mailing list
>>>>>>>>>
>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>
>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>>>
>>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>
>>>>>>  --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at http://www.gromacs.org/
>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>>>>
>>>>
>>>>
>>
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.