[gmx-users] Can we set the number of pure PME nodes when using GPU&CPU?

Mon Aug 25 05:01:27 CEST 2014

Hi,

https://onedrive.live.com/redir?resid=990FCE59E48164A4!2572&authkey=!AP82sTNxS6MHgUk&ithint=file%2clog
https://onedrive.live.com/redir?resid=990FCE59E48164A4!2482&authkey=!APLkizOBzXtPHxs&ithint=file%2clog

These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4 
nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
When we look at the 64 cores log file, we find that in the  R E A L   C 
Y C L E   A N D   T I M E   A C C O U N T I N G table, the total wall 
time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So 
we think that when the CPUs is doing PME, GPUs are doing nothing. That's 
why we say they are working sequentially.
As for the 512 cores log file, the total wall time is approximately the 
sum of PME mesh and PME wait for PP. We think this is because 
PME-dedicated nodes finished early, and the total wall time is the time 
spent on PP nodes, therefore time spent on PME is covered.

On 8/23/2014 9:30 PM, Mark Abraham wrote:
> On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si <sjyzhxw at gmail.com> wrote:
>
>> Hi,
>>
>> When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
>> no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
>> idle,
>
> That could happen if the GPU completes its work too fast, in which case the
> end of the log file will probably scream about imbalance.
>
> that is they are doing their work sequentially.
>
>
> Highly unlikely, not least because the code is written to overlap the
> short-range work on the GPU with everything else on the CPU. What's your
> evidence for *sequential* rather than *imbalanced*?
>
>
>> Is it supposed to be so?
>
> No, but without seeing your .log files, mdrun command lines and knowing
> about your hardware, there's nothing we can say.
>
>
>> Is it the same reason as GPUs on PME-dedicated nodes won't be used during
>> a run like you said before?
>
> Why would you suppose that? I said GPUs do work from the PP ranks on their
> node. That's true here.
>
> So if we want to exploit our hardware, we should map PP-PME ranks manually,
>> right? Say, use one node as PME-dedicated node and leave the GPUs on that
>> node idle, and use two nodes to do the other stuff. How do you think about
>> this arrangement?
>>
> Probably a terrible idea. You should identify the cause of the imbalance,
> and fix that.
>
> Mark
>
>
>> Theo
>>
>>
>> On 8/22/2014 7:20 PM, Mark Abraham wrote:
>>
>>> Hi,
>>>
>>> Because no work will be sent to them. The GPU implementation can
>>> accelerate
>>> domains from PP ranks on their node, but with an MPMD setup that uses
>>> dedicated PME nodes, there will be no PP ranks on nodes that have been set
>>> up with only PME ranks. The two offload models (PP work -> GPU; PME work
>>> ->
>>> CPU subset) do not work well together, as I said.
>>>
>>> One can devise various schemes in 4.6/5.0 that could use those GPUs, but
>>> they either require
>>> * each node does both PME and PP work (thus limiting scaling because of
>>> the
>>> all-to-all for PME, and perhaps making poor use of locality on
>>> multi-socket
>>> nodes), or
>>> * that all nodes have PP ranks, but only some have PME ranks, and the
>>> nodes
>>> map their GPUs to PP ranks in a way that is different depending on whether
>>> PME ranks are present (which could work well, but relies on the DD
>>> load-balancer recognizing and taking advantage of the faster progress of
>>> the PP ranks that have better GPU support, and requires that you get very
>>> dirty hands laying out PP and PME ranks onto hardware that will later
>>> match
>>> the requirements of the DD load balancer, and probably that you balance
>>> PP-PME load manually)
>>>
>>> I do not recommend the last approach, because of its complexity.
>>>
>>> Clearly there are design decisions to improve. Work is underway.
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>>
>>> On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>
>>>   Hi Mark,
>>>> Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
>>>> nodes, the GPU on such nodes will be idle?
>>>>
>>>>
>>>> Theo
>>>>
>>>> On 8/11/2014 9:36 PM, Mark Abraham wrote:
>>>>
>>>>   Hi,
>>>>> What Carsten said, if running on nodes that have GPUs.
>>>>>
>>>>> If running on a mixed setup (some nodes with GPU, some not), then
>>>>> arranging
>>>>> your MPI environment to place PME ranks on CPU-only nodes is probably
>>>>> worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
>>>>> then
>>>>> all your PME ranks, mapped to CPU-only nodes, and then use mdrun
>>>>> -ddorder
>>>>> pp_pme.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>>>
>>>>>    Hi Mark,
>>>>>
>>>>>> This is information of our cluster, could you give us some advice as
>>>>>> regards to our cluster so that we can make GMX run faster on our
>>>>>> system?
>>>>>>
>>>>>> Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M
>>>>>>
>>>>>>
>>>>>> Device Name     Device Type     Specifications  Number
>>>>>> CPU Node        IntelH2216JFFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     332
>>>>>> Fat Node        IntelH2216WPFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory    20
>>>>>> GPU Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     50
>>>>>> MIC Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     5
>>>>>> Computing Network Switch        Mellanox Infiniband FDR Core Switch
>>>>>> 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
>>>>>>   1
>>>>>> Mellanox SX1036 40Gb Switch     36× 40Gb Ethernet Switch SX1036, 36×
>>>>>> QSFP
>>>>>> Interface     1
>>>>>> Management Network Switch       Extreme Summit X440-48t-10G 2-layer
>>>>>> Switch
>>>>>> 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS       9
>>>>>> Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
>>>>>> Switch
>>>>>> Summit X650-24X, authorized by ExtremeXOS    1
>>>>>> Parallel Storage        DDN Parallel Storage System     DDN SFA12K
>>>>>> Storage
>>>>>> System       1
>>>>>> GPU     GPU Accelerator         NVIDIA Tesla Kepler K20M        70
>>>>>> MIC     MIC     Intel Xeon Phi 5110P Knights Corner     10
>>>>>> 40Gb Ethernet Card      MCX314A-BCBT    Mellanox ConnextX-3 Chip 40Gb
>>>>>> Ethernet Card
>>>>>> 2× 40Gb Ethernet ports, enough QSFP cables      16
>>>>>> SSD     Intel SSD910    Intel SSD910 Disk, 400GB, PCIE  80
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/10/2014 5:50 AM, Mark Abraham wrote:
>>>>>>
>>>>>>    That's not what I said.... "You can set..."
>>>>>>
>>>>>>> -npme behaves the same whether or not GPUs are in use. Using separate
>>>>>>> ranks
>>>>>>> for PME caters to trying to minimize the cost of the all-to-all
>>>>>>> communication of the 3DFFT. That's still relevant when using GPUs, but
>>>>>>> if
>>>>>>> separate PME ranks are used, any GPUs on nodes that only have PME
>>>>>>> ranks
>>>>>>> are
>>>>>>> left idle. The most effective approach depends critically on the
>>>>>>> hardware
>>>>>>> and simulation setup, and whether you pay money for your hardware.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>   You mean no matter we use GPU acceleration or not, -npme is just a
>>>>>>>> reference?
>>>>>>>> Why we can't set that to a exact value?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/9/2014 5:14 AM, Mark Abraham wrote:
>>>>>>>>
>>>>>>>>     You can set the number of PME-only ranks with -npme. Whether it's
>>>>>>>> useful
>>>>>>>>
>>>>>>>>   is
>>>>>>>>> another matter :-) The CPU-based PME offload and the GPU-based PP
>>>>>>>>> offload
>>>>>>>>> do not combine very well.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>      Hi,
>>>>>>>>>
>>>>>>>>>    Can we set the number manually with -npme when using GPU
>>>>>>>>>
>>>>>>>>>> acceleration?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>
>>>>>>>>>>   Gromacs Users mailing list
>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>
>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>
>>>>>>>>
>>>>>>>>    --
>>>>>>>>
>>>>>>> Gromacs Users mailing list
>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>
>>>>>>   --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at http://www.gromacs.org/
>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>>
>>