[gmx-users] Can we set the number of pure PME nodes when using GPU&CPU?

Szilárd Páll pall.szilard at gmail.com
Tue Aug 26 13:29:58 CEST 2014


On Tue, Aug 26, 2014 at 7:51 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
> Hi  Szilárd,
>
> But CUDA 5.5 won't work with icc 14, right?

Sure, but I don't see how is gcc 4.4 + CUDA 5.0 superior to [a recent
compiler that nvcc supports] + CUDA 5.5?

Additionally, as I said before, gcc 4.8 will almost certainly outperform icc.

Cheers,
--
Szilárd

> It only works with 12.1 unless a header of CUDA 5.5 to be modified.
>
> Theo
>
>
> On 8/25/2014 9:44 PM, Szilárd Páll wrote:
>>
>> On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham <mark.j.abraham at gmail.com>
>> wrote:
>>>
>>> On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>>>> 2572&authkey=!AP82sTNxS6MHgUk&ithint=file%2clog
>>>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>>>> 2482&authkey=!APLkizOBzXtPHxs&ithint=file%2clog
>>>>
>>>> These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
>>>> nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no
>>>> GPU.
>>>> When we look at the 64 cores log file, we find that in the  R E A L   C
>>>> Y
>>>> C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time
>>>> is
>>>> the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
>>>> that when the CPUs is doing PME, GPUs are doing nothing. That's why we
>>>> say
>>>> they are working sequentially.
>>>>
>>> Please note that "sequential" means "one phase after another." Your log
>>> files don't show the timing breakdown for the GPUs, which is distinct
>>> from
>>> showing that the GPUs ran and then the CPUs ran (which I don't think the
>>> code even permits!). References to "CUDA 8x8 kernels" do show the GPU was
>>> active. There was an issue with mdrun not always being able to gather and
>>> publish the GPU timing results; I don't recall the conditions (Szilard
>>> might remember), but it might be fixed in a later release.
>>
>> It is a limitation (well, I'd say borderline bug) in CUDA that if you
>> have multiple work-queues (=streams), reliable timing using the CUDA
>> built-in mechanisms is impossible. There may be a way to work around
>> this, but that won't happen in the current versions. What's important
>> is to observe the wait time on the CPU sideand of course, if the Op is
>> profiling this is not an issue.
>>
>>> In any case, you
>>> should probably be doing performance optimization on a GROMACS version
>>> that
>>> isn't a year old.
>>>
>>> I gather that you didn't actually observe the GPUs idle - e.g. with a
>>> performance monitoring tool? Otherwise, and in the absence of a
>>> description
>>> of your simulation system, I'd say that log file looks somewhere between
>>> normal and optimal. For the record, for better performance, you should
>>> probably be following the advice of the install guide and not compiling
>>> FFTW with AVX support, and using one of the five gcc minor versions
>>> released since 4.4 ;-)
>>
>> And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
>> (which you can use because you have version 5.5 driver which I see in
>> your log file):
>>
>> Additionally, I suggest avoiding MKL and using FFTW instead. For the
>> grid sizes of our interest all benchmarks I did in the past showed
>> considerably higher FFTW performance. Same goes for icc, but feel free
>> to benchmark and please report back if you find the opposite.
>>
>>> As for the 512 cores log file, the total wall time is approximately the
>>> sum
>>>>
>>>> of PME mesh and PME wait for PP. We think this is because PME-dedicated
>>>> nodes finished early, and the total wall time is the time spent on PP
>>>> nodes, therefore time spent on PME is covered.
>>>
>>>
>>> Yes, using an offload model makes it awkward to report CPU timings,
>>> because
>>> there are two kinds of CPU ranks. The total of the "Wall t" column adds
>>> up
>>> to twice the total time taken (which is noted explicitly in more recent
>>> mdrun versions). By design, the PME ranks do finish early, as you know
>>> from
>>> Figure 3.16 of the manual. As you can see in the table, the PP ranks
>>> spend
>>> 26% of their time waiting for the results from the PME ranks, and this is
>>> the origin of the note (above the table) that you might want to balance
>>> things better.
>>>
>>> Mark
>>>
>>> On 8/23/2014 9:30 PM, Mark Abraham wrote:
>>>>>
>>>>> On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>>>
>>>>>   Hi,
>>>>>>
>>>>>> When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a
>>>>>> mdrun(with
>>>>>> no PME-dedicated node), we noticed that when CPU are doing PME, GPU
>>>>>> are
>>>>>> idle,
>>>>>>
>>>>> That could happen if the GPU completes its work too fast, in which case
>>>>> the
>>>>> end of the log file will probably scream about imbalance.
>>>>>
>>>>> that is they are doing their work sequentially.
>>>>>
>>>>>
>>>>> Highly unlikely, not least because the code is written to overlap the
>>>>> short-range work on the GPU with everything else on the CPU. What's
>>>>> your
>>>>> evidence for *sequential* rather than *imbalanced*?
>>>>>
>>>>>
>>>>>   Is it supposed to be so?
>>>>> No, but without seeing your .log files, mdrun command lines and knowing
>>>>> about your hardware, there's nothing we can say.
>>>>>
>>>>>
>>>>>   Is it the same reason as GPUs on PME-dedicated nodes won't be used
>>>>> during
>>>>>>
>>>>>> a run like you said before?
>>>>>>
>>>>> Why would you suppose that? I said GPUs do work from the PP ranks on
>>>>> their
>>>>> node. That's true here.
>>>>>
>>>>> So if we want to exploit our hardware, we should map PP-PME ranks
>>>>> manually,
>>>>>
>>>>>> right? Say, use one node as PME-dedicated node and leave the GPUs on
>>>>>> that
>>>>>> node idle, and use two nodes to do the other stuff. How do you think
>>>>>> about
>>>>>> this arrangement?
>>>>>>
>>>>>>   Probably a terrible idea. You should identify the cause of the
>>>>>
>>>>> imbalance,
>>>>> and fix that.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>   Theo
>>>>>>
>>>>>>
>>>>>> On 8/22/2014 7:20 PM, Mark Abraham wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>>
>>>>>>> Because no work will be sent to them. The GPU implementation can
>>>>>>> accelerate
>>>>>>> domains from PP ranks on their node, but with an MPMD setup that uses
>>>>>>> dedicated PME nodes, there will be no PP ranks on nodes that have
>>>>>>> been
>>>>>>> set
>>>>>>> up with only PME ranks. The two offload models (PP work -> GPU; PME
>>>>>>> work
>>>>>>> ->
>>>>>>> CPU subset) do not work well together, as I said.
>>>>>>>
>>>>>>> One can devise various schemes in 4.6/5.0 that could use those GPUs,
>>>>>>> but
>>>>>>> they either require
>>>>>>> * each node does both PME and PP work (thus limiting scaling because
>>>>>>> of
>>>>>>> the
>>>>>>> all-to-all for PME, and perhaps making poor use of locality on
>>>>>>> multi-socket
>>>>>>> nodes), or
>>>>>>> * that all nodes have PP ranks, but only some have PME ranks, and the
>>>>>>> nodes
>>>>>>> map their GPUs to PP ranks in a way that is different depending on
>>>>>>> whether
>>>>>>> PME ranks are present (which could work well, but relies on the DD
>>>>>>> load-balancer recognizing and taking advantage of the faster progress
>>>>>>> of
>>>>>>> the PP ranks that have better GPU support, and requires that you get
>>>>>>> very
>>>>>>> dirty hands laying out PP and PME ranks onto hardware that will later
>>>>>>> match
>>>>>>> the requirements of the DD load balancer, and probably that you
>>>>>>> balance
>>>>>>> PP-PME load manually)
>>>>>>>
>>>>>>> I do not recommend the last approach, because of its complexity.
>>>>>>>
>>>>>>> Clearly there are design decisions to improve. Work is underway.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    Hi Mark,
>>>>>>>
>>>>>>>> Could you tell me why that when we are GPU-CPU nodes as
>>>>>>>> PME-dedicated
>>>>>>>> nodes, the GPU on such nodes will be idle?
>>>>>>>>
>>>>>>>>
>>>>>>>> Theo
>>>>>>>>
>>>>>>>> On 8/11/2014 9:36 PM, Mark Abraham wrote:
>>>>>>>>
>>>>>>>>    Hi,
>>>>>>>>
>>>>>>>>> What Carsten said, if running on nodes that have GPUs.
>>>>>>>>>
>>>>>>>>> If running on a mixed setup (some nodes with GPU, some not), then
>>>>>>>>> arranging
>>>>>>>>> your MPI environment to place PME ranks on CPU-only nodes is
>>>>>>>>> probably
>>>>>>>>> worthwhile. For example, all your PP ranks first, mapped to GPU
>>>>>>>>> nodes,
>>>>>>>>> then
>>>>>>>>> all your PME ranks, mapped to CPU-only nodes, and then use mdrun
>>>>>>>>> -ddorder
>>>>>>>>> pp_pme.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>     Hi Mark,
>>>>>>>>>
>>>>>>>>>   This is information of our cluster, could you give us some advice
>>>>>>>>> as
>>>>>>>>>>
>>>>>>>>>> regards to our cluster so that we can make GMX run faster on our
>>>>>>>>>> system?
>>>>>>>>>>
>>>>>>>>>> Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia
>>>>>>>>>> K20M
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Device Name     Device Type     Specifications  Number
>>>>>>>>>> CPU Node        IntelH2216JFFKRNodes    CPU: 2×Intel Xeon
>>>>>>>>>> E5-2670(8
>>>>>>>>>> Cores,
>>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory
>>>>>>>>>> 332
>>>>>>>>>> Fat Node        IntelH2216WPFKRNodes    CPU: 2×Intel Xeon
>>>>>>>>>> E5-2670(8
>>>>>>>>>> Cores,
>>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>>> Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory    20
>>>>>>>>>> GPU Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon
>>>>>>>>>> E5-2670(8
>>>>>>>>>> Cores,
>>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     50
>>>>>>>>>> MIC Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon
>>>>>>>>>> E5-2670(8
>>>>>>>>>> Cores,
>>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     5
>>>>>>>>>> Computing Network Switch        Mellanox Infiniband FDR Core
>>>>>>>>>> Switch
>>>>>>>>>> 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
>>>>>>>>>>    1
>>>>>>>>>> Mellanox SX1036 40Gb Switch     36× 40Gb Ethernet Switch SX1036,
>>>>>>>>>> 36×
>>>>>>>>>> QSFP
>>>>>>>>>> Interface     1
>>>>>>>>>> Management Network Switch       Extreme Summit X440-48t-10G
>>>>>>>>>> 2-layer
>>>>>>>>>> Switch
>>>>>>>>>> 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS
>>>>>>>>>>   9
>>>>>>>>>> Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer
>>>>>>>>>> Ethernet
>>>>>>>>>> Switch
>>>>>>>>>> Summit X650-24X, authorized by ExtremeXOS    1
>>>>>>>>>> Parallel Storage        DDN Parallel Storage System     DDN SFA12K
>>>>>>>>>> Storage
>>>>>>>>>> System       1
>>>>>>>>>> GPU     GPU Accelerator         NVIDIA Tesla Kepler K20M        70
>>>>>>>>>> MIC     MIC     Intel Xeon Phi 5110P Knights Corner     10
>>>>>>>>>> 40Gb Ethernet Card      MCX314A-BCBT    Mellanox ConnextX-3 Chip
>>>>>>>>>> 40Gb
>>>>>>>>>> Ethernet Card
>>>>>>>>>> 2× 40Gb Ethernet ports, enough QSFP cables      16
>>>>>>>>>> SSD     Intel SSD910    Intel SSD910 Disk, 400GB, PCIE  80
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/10/2014 5:50 AM, Mark Abraham wrote:
>>>>>>>>>>
>>>>>>>>>>     That's not what I said.... "You can set..."
>>>>>>>>>>
>>>>>>>>>>   -npme behaves the same whether or not GPUs are in use. Using
>>>>>>>>>>>
>>>>>>>>>>> separate
>>>>>>>>>>> ranks
>>>>>>>>>>> for PME caters to trying to minimize the cost of the all-to-all
>>>>>>>>>>> communication of the 3DFFT. That's still relevant when using
>>>>>>>>>>> GPUs,
>>>>>>>>>>> but
>>>>>>>>>>> if
>>>>>>>>>>> separate PME ranks are used, any GPUs on nodes that only have PME
>>>>>>>>>>> ranks
>>>>>>>>>>> are
>>>>>>>>>>> left idle. The most effective approach depends critically on the
>>>>>>>>>>> hardware
>>>>>>>>>>> and simulation setup, and whether you pay money for your
>>>>>>>>>>> hardware.
>>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>      Hi,
>>>>>>>>>>>
>>>>>>>>>>>    You mean no matter we use GPU acceleration or not, -npme is
>>>>>>>>>>> just a
>>>>>>>>>>>
>>>>>>>>>>>> reference?
>>>>>>>>>>>> Why we can't set that to a exact value?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/9/2014 5:14 AM, Mark Abraham wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>      You can set the number of PME-only ranks with -npme.
>>>>>>>>>>>> Whether
>>>>>>>>>>>> it's
>>>>>>>>>>>> useful
>>>>>>>>>>>>
>>>>>>>>>>>>    is
>>>>>>>>>>>>
>>>>>>>>>>>>> another matter :-) The CPU-based PME offload and the GPU-based
>>>>>>>>>>>>> PP
>>>>>>>>>>>>> offload
>>>>>>>>>>>>> do not combine very well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>       Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Can we set the number manually with -npme when using GPU
>>>>>>>>>>>>>
>>>>>>>>>>>>>   acceleration?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Can't post? Read
>>>>>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>>>>>>>>>>>>>> gmx-users
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Gromacs Users mailing list
>>>>>>>>>>>>>>
>>>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>>>>
>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>
>>>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>
>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>
>>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>>>> or
>>>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>     --
>>>>>>>>>>>>
>>>>>>>>>>>>   Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    --
>>>>>>>>>>
>>>>>>>>> Gromacs Users mailing list
>>>>>>>>
>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>
>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list