[gmx-users] Can we set the number of pure PME nodes when using GPU&CPU?
Theodore Si
sjyzhxw at gmail.com
Mon Aug 25 05:01:27 CEST 2014
Hi,
https://onedrive.live.com/redir?resid=990FCE59E48164A4!2572&authkey=!AP82sTNxS6MHgUk&ithint=file%2clog
https://onedrive.live.com/redir?resid=990FCE59E48164A4!2482&authkey=!APLkizOBzXtPHxs&ithint=file%2clog
These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4
nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
When we look at the 64 cores log file, we find that in the R E A L C
Y C L E A N D T I M E A C C O U N T I N G table, the total wall
time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So
we think that when the CPUs is doing PME, GPUs are doing nothing. That's
why we say they are working sequentially.
As for the 512 cores log file, the total wall time is approximately the
sum of PME mesh and PME wait for PP. We think this is because
PME-dedicated nodes finished early, and the total wall time is the time
spent on PP nodes, therefore time spent on PME is covered.
On 8/23/2014 9:30 PM, Mark Abraham wrote:
> On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si <sjyzhxw at gmail.com> wrote:
>
>> Hi,
>>
>> When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
>> no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
>> idle,
>
> That could happen if the GPU completes its work too fast, in which case the
> end of the log file will probably scream about imbalance.
>
> that is they are doing their work sequentially.
>
>
> Highly unlikely, not least because the code is written to overlap the
> short-range work on the GPU with everything else on the CPU. What's your
> evidence for *sequential* rather than *imbalanced*?
>
>
>> Is it supposed to be so?
>
> No, but without seeing your .log files, mdrun command lines and knowing
> about your hardware, there's nothing we can say.
>
>
>> Is it the same reason as GPUs on PME-dedicated nodes won't be used during
>> a run like you said before?
>
> Why would you suppose that? I said GPUs do work from the PP ranks on their
> node. That's true here.
>
> So if we want to exploit our hardware, we should map PP-PME ranks manually,
>> right? Say, use one node as PME-dedicated node and leave the GPUs on that
>> node idle, and use two nodes to do the other stuff. How do you think about
>> this arrangement?
>>
> Probably a terrible idea. You should identify the cause of the imbalance,
> and fix that.
>
> Mark
>
>
>> Theo
>>
>>
>> On 8/22/2014 7:20 PM, Mark Abraham wrote:
>>
>>> Hi,
>>>
>>> Because no work will be sent to them. The GPU implementation can
>>> accelerate
>>> domains from PP ranks on their node, but with an MPMD setup that uses
>>> dedicated PME nodes, there will be no PP ranks on nodes that have been set
>>> up with only PME ranks. The two offload models (PP work -> GPU; PME work
>>> ->
>>> CPU subset) do not work well together, as I said.
>>>
>>> One can devise various schemes in 4.6/5.0 that could use those GPUs, but
>>> they either require
>>> * each node does both PME and PP work (thus limiting scaling because of
>>> the
>>> all-to-all for PME, and perhaps making poor use of locality on
>>> multi-socket
>>> nodes), or
>>> * that all nodes have PP ranks, but only some have PME ranks, and the
>>> nodes
>>> map their GPUs to PP ranks in a way that is different depending on whether
>>> PME ranks are present (which could work well, but relies on the DD
>>> load-balancer recognizing and taking advantage of the faster progress of
>>> the PP ranks that have better GPU support, and requires that you get very
>>> dirty hands laying out PP and PME ranks onto hardware that will later
>>> match
>>> the requirements of the DD load balancer, and probably that you balance
>>> PP-PME load manually)
>>>
>>> I do not recommend the last approach, because of its complexity.
>>>
>>> Clearly there are design decisions to improve. Work is underway.
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>>
>>> On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>
>>> Hi Mark,
>>>> Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
>>>> nodes, the GPU on such nodes will be idle?
>>>>
>>>>
>>>> Theo
>>>>
>>>> On 8/11/2014 9:36 PM, Mark Abraham wrote:
>>>>
>>>> Hi,
>>>>> What Carsten said, if running on nodes that have GPUs.
>>>>>
>>>>> If running on a mixed setup (some nodes with GPU, some not), then
>>>>> arranging
>>>>> your MPI environment to place PME ranks on CPU-only nodes is probably
>>>>> worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
>>>>> then
>>>>> all your PME ranks, mapped to CPU-only nodes, and then use mdrun
>>>>> -ddorder
>>>>> pp_pme.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>>> This is information of our cluster, could you give us some advice as
>>>>>> regards to our cluster so that we can make GMX run faster on our
>>>>>> system?
>>>>>>
>>>>>> Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M
>>>>>>
>>>>>>
>>>>>> Device Name Device Type Specifications Number
>>>>>> CPU Node IntelH2216JFFKRNodes CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332
>>>>>> Fat Node IntelH2216WPFKRNodes CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory 20
>>>>>> GPU Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50
>>>>>> MIC Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8
>>>>>> Cores,
>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5
>>>>>> Computing Network Switch Mellanox Infiniband FDR Core Switch
>>>>>> 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
>>>>>> 1
>>>>>> Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36×
>>>>>> QSFP
>>>>>> Interface 1
>>>>>> Management Network Switch Extreme Summit X440-48t-10G 2-layer
>>>>>> Switch
>>>>>> 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9
>>>>>> Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet
>>>>>> Switch
>>>>>> Summit X650-24X, authorized by ExtremeXOS 1
>>>>>> Parallel Storage DDN Parallel Storage System DDN SFA12K
>>>>>> Storage
>>>>>> System 1
>>>>>> GPU GPU Accelerator NVIDIA Tesla Kepler K20M 70
>>>>>> MIC MIC Intel Xeon Phi 5110P Knights Corner 10
>>>>>> 40Gb Ethernet Card MCX314A-BCBT Mellanox ConnextX-3 Chip 40Gb
>>>>>> Ethernet Card
>>>>>> 2× 40Gb Ethernet ports, enough QSFP cables 16
>>>>>> SSD Intel SSD910 Intel SSD910 Disk, 400GB, PCIE 80
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/10/2014 5:50 AM, Mark Abraham wrote:
>>>>>>
>>>>>> That's not what I said.... "You can set..."
>>>>>>
>>>>>>> -npme behaves the same whether or not GPUs are in use. Using separate
>>>>>>> ranks
>>>>>>> for PME caters to trying to minimize the cost of the all-to-all
>>>>>>> communication of the 3DFFT. That's still relevant when using GPUs, but
>>>>>>> if
>>>>>>> separate PME ranks are used, any GPUs on nodes that only have PME
>>>>>>> ranks
>>>>>>> are
>>>>>>> left idle. The most effective approach depends critically on the
>>>>>>> hardware
>>>>>>> and simulation setup, and whether you pay money for your hardware.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> You mean no matter we use GPU acceleration or not, -npme is just a
>>>>>>>> reference?
>>>>>>>> Why we can't set that to a exact value?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/9/2014 5:14 AM, Mark Abraham wrote:
>>>>>>>>
>>>>>>>> You can set the number of PME-only ranks with -npme. Whether it's
>>>>>>>> useful
>>>>>>>>
>>>>>>>> is
>>>>>>>>> another matter :-) The CPU-based PME offload and the GPU-based PP
>>>>>>>>> offload
>>>>>>>>> do not combine very well.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Can we set the number manually with -npme when using GPU
>>>>>>>>>
>>>>>>>>>> acceleration?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>
>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>> Gromacs Users mailing list
>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>
>>>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at http://www.gromacs.org/
>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>>>
>>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
>>
>>
More information about the gromacs.org_gmx-users
mailing list