[gmx-users] GPU and aux power supply
Alex
nedomacho at gmail.com
Thu Jul 2 23:55:46 CEST 2015
Here's for the CPU-only run:
R E A L C Y C L E A N D T I M E A C C O U N T I N G
On 1 MPI rank, each using 4 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 4 25001 17.536 168.352 0.6
Force 1 4 1000001 1047.980 10061.133 37.5
PME mesh 1 4 1000001 1661.611 15952.292 59.4
NB X/F buffer ops. 1 4 1975001 30.176 289.700 1.1
COM pull force 1 4 1000001 7.282 69.909 0.3
Write traj. 1 4 64 0.402 3.860 0.0
Update 1 4 1000001 19.559 187.772 0.7
Rest 13.141 126.156 0.5
-----------------------------------------------------------------------------
Total 2797.685 26859.173 100.0
-----------------------------------------------------------------------------
Breakdown of PME mesh computation
-----------------------------------------------------------------------------
PME spread/gather 1 4 2000002 318.488 3057.640 11.4
PME 3D-FFT 1 4 2000002 1091.863 10482.433 39.0
PME solve Elec 1 4 1000001 247.867 2379.646 8.9
-----------------------------------------------------------------------------
Core t (s) Wall t (s) (%)
Time: 11193.860 2797.685 400.1
46:37
(ns/day) (hour/ns)
Performance: 30.883 0.777
Finished mdrun on rank 0 Thu Jul 2 17:06:08 2015
On Thu, Jul 2, 2015 at 2:47 PM, Alex <nedomacho at gmail.com> wrote:
> Szilárd,
>
> I was wrong. When I run with GPU and use -ntomp 4, I have 400% CPU
> utilization and that yields about 83 ns/day. When I do -ntomp 4 -nb cpu, I
> get 1600% CPU utilization and get similar results.
> However, when I run -nt 4 -nb cpu, I get 400% CPU utilization, and then it
> is slower. I am doing a short test, will send the stats later on.
>
> The stats from GPU-accelerated (-ntomp 4) are below. Pretty poor CPU-GPU
> sync here, actually. Will post the log for CPU-only run once it finishes.
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> On 1 MPI rank, each using 4 OpenMP threads
>
> Computing: Num Num Call Wall time Giga-Cycles
> Ranks Threads Count (s) total sum %
>
> -----------------------------------------------------------------------------
> Neighbor search 1 4 2500001 1881.311 18061.525 1.8
> Launch GPU ops. 1 4 100000001 4713.584 45252.759 4.5
> Force 1 4 100000001 66892.607 642202.401 63.5
> PME mesh 1 4 100000001 25192.879 241864.204 23.9
> Wait GPU local 1 4 100000001 869.481 8347.456 0.8
> NB X/F buffer ops. 1 4 197500001 2014.227 19337.585 1.9
> COM pull force 1 4 100000001 704.950 6767.871 0.7
> Write traj. 1 4 6118 15.348 147.345 0.0
> Update 1 4 100000001 1747.965 16781.332 1.7
> Rest 1364.705 13101.849 1.3
>
> -----------------------------------------------------------------------------
> Total 105397.057 1011864.328 100.0
>
> -----------------------------------------------------------------------------
> Breakdown of PME mesh computation
>
> -----------------------------------------------------------------------------
> PME spread/gather 1 4 200000002 12874.626 123602.829 12.2
> PME 3D-FFT 1 4 200000002 9285.345 89143.948 8.8
> PME solve Elec 1 4 100000001 2746.973 26372.313 2.6
>
> -----------------------------------------------------------------------------
>
> GPU timings
>
> -----------------------------------------------------------------------------
> Computing: Count Wall t (s) ms/step %
>
> -----------------------------------------------------------------------------
> Pair list H2D 2500001 124.145 0.050 0.4
> X / q H2D 100000001 2089.623 0.021 6.0
> Nonbonded F kernel 97000000 30164.146 0.311 86.2
> Nonbonded F+ene k. 500000 227.896 0.456 0.7
> Nonbonded F+prune k. 2000000 708.250 0.354 2.0
> Nonbonded F+ene+prune k. 500001 223.082 0.446 0.6
> F D2H 100000001 1465.277 0.015 4.2
>
> -----------------------------------------------------------------------------
> Total 35002.419 0.350 100.0
>
> -----------------------------------------------------------------------------
>
> Force evaluation time GPU/CPU: 0.350 ms/0.921 ms = 0.380
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >25% less load than the CPU. This imbalance causes
> performance loss.
>
> Core t (s) Wall t (s) (%)
> Time: 421720.882 105397.057 400.1
> 1d05h16:37
> (ns/day) (hour/ns)
> Performance: 81.976 0.293
> Finished mdrun on rank 0 Thu Jul 2 02:29:57 2015
>
>
> On Thu, Jul 2, 2015 at 7:57 AM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>> I'm curious what are the conditions under which you get such a
>> exceptional speedup. Can you share your input files and/or log files?
>>
>> --
>> Szilárd
>>
>> On Thu, Jul 2, 2015 at 2:18 AM, Alex <nedomacho at gmail.com> wrote:
>>
>>> Yup, about 7-8 times between with and without GPU acceleration, not
>>> making this up: I had 11 ns/day and now ~80-87 ns/day, the numbers vary a
>>> bit. I've been getting a similar boost on our GPU-accelerated cluster node
>>> (dual core i7, 8 cores each) with two Tesla C2075 cards (I am directing my
>>> simulations to one of them via -gpu_id).
>>> All runs are -ntomp 4, with or without GPU. The physics in all cases is
>>> perfectly acceptable. So far I only tested my new box on vacuum
>>> simulations, about to run the solvated version (~30K particles).
>>>
>>> Alex
>>>
>>>
>>> On Wed, Jul 1, 2015 at 6:09 PM, Szilárd Páll <pall.szilard at gmail.com>
>>> wrote:
>>>
>>>> Hmmm, 8x sounds rather high, are you sure you are comparing to CPU-only
>>>> runs that use proper SIMD optimized kernels?
>>>>
>>>> Because of the way offload-based acceleration works, the CPU and GPU
>>>> will inherently be executing concurrently only part of the runtime and as a
>>>> consequence the GPU is idle part of the run-time (during
>>>> integration+constraints). You can make use of this idle time by running
>>>> multiple independent simulations concurrently. This can yield serious
>>>> improvements in terms of _aggregate_ simulation performance especially with
>>>> small inputs and many cores (see slide 51 https://goo.gl/7DnSri)/
>>>>
>>>> --
>>>> Szilárd
>>>>
>>>> On Wed, Jul 1, 2015 at 4:16 AM, Alex <nedomacho at gmail.com> wrote:
>>>>
>>>>> I am happy to say that I am getting an 8-fold increase in simulation
>>>>> speeds for $200.
>>>>>
>>>>>
>>>>> An additional question: normally, how many simulations (separate
>>>>> mdruns on separate CPU cores) can be performed simultaneously on a single
>>>>> GPU? Say, for 20-40K particle sized simulations.
>>>>>
>>>>> The coolers are not even spinning during a single test (mdrun -ntomp
>>>>> 4), and I get massive acceleration. They aren't broken, the card is just
>>>>> cool (small system, ~3K particles).
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>> Ah, ok, so you can get a 6-pin from the PSU and another from a
>>>>> converted molex connector. That should be just fine, especially as the card
>>>>> should will not pull more than ~155W (under heavy graphics load) based on
>>>>> the Tomshardware review* and you are providing 225W max.
>>>>>
>>>>>
>>>>>
>>>>> *
>>>>> http://www.tomshardware.com/reviews/evga-super-super-clocked-gtx-960,4063-3.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Szilárd
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 30, 2015 at 7:31 PM, Alex <nedomacho at gmail.com> wrote:
>>>>>
>>>>>
>>>>> Well, I don't have one like this. What I have instead is this:
>>>>>
>>>>>
>>>>> 1. A single 6-pin directly from the PSU.
>>>>>
>>>>> 2. A single molex to 6-pin (my PSU does provide one molex).
>>>>>
>>>>> 3. Two 6-pins to a single 8-pin converter going to the card.
>>>>>
>>>>>
>>>>> In other words, I can populate both 6-pins on the 6-8 converter, just
>>>>> not sure about the pinouts in this case.
>>>>>
>>>>>
>>>>> Not good?
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>
>>>>> >
>>>>>
>>>>> What I meant is this: http://goo.gl/8o1B5P
>>>>>
>>>>>
>>>>> That is 2x molex -> 8pin PCI-E. A single molex may not be enouhg.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Szilárd
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 30, 2015 at 7:10 PM, Alex <nedomacho at gmail.com> wrote:
>>>>>
>>>>>
>>>>> It is a 4-core CPU, single GPU box, so I doubt I will be running more
>>>>>
>>>>> than one at a time. We will very likely get a different PSU, unless...
>>>>>
>>>>> I do have a molex to 6 pin concerter sitting on this very desk. Do you
>>>>>
>>>>> think it will satisfy the card? I just don't know how much a single
>>>>>
>>>>> molex line delivers. If you feel this should work, off to installing
>>>>>
>>>>> everything I go.
>>>>>
>>>>>
>>>>> Thanks a bunch,
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>> SP> First of all, unless you run multiple independent simulations on
>>>>> the same
>>>>>
>>>>> SP> GPU, GROMACS runs alone will never get anywhere near the peak power
>>>>>
>>>>> SP> consumption of the GPU.
>>>>>
>>>>>
>>>>> SP> The good news is that NVIDIA has gained some sanity and stopped
>>>>> blocking
>>>>>
>>>>> SP> GeForce GPU info in nvidia-smi - although only for newer cars, but
>>>>> it does
>>>>>
>>>>> SP> work with the 960 if you use a 352.xx driver:
>>>>>
>>>>> SP> +------------------------------------------------------+
>>>>>
>>>>>
>>>>> SP> | NVIDIA-SMI 352.21 Driver Version: 352.21 |
>>>>>
>>>>>
>>>>> SP>
>>>>> |-------------------------------+----------------------+----------------------+
>>>>>
>>>>> SP> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
>>>>> Uncorr.
>>>>>
>>>>> SP> ECC |
>>>>>
>>>>> SP> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>>>>> Compute
>>>>>
>>>>> SP> M. |
>>>>>
>>>>> SP>
>>>>> |===============================+======================+======================|
>>>>>
>>>>> SP> | 0 GeForce GTX 960 Off | 0000:01:00.0 On |
>>>>>
>>>>> SP> N/A |
>>>>>
>>>>> SP> | 8% 45C P5 15W / 130W | 1168MiB / 2044MiB | 31%
>>>>>
>>>>> SP> Default |
>>>>>
>>>>> SP>
>>>>> +-------------------------------+----------------------+----------------------+
>>>>>
>>>>>
>>>>>
>>>>> SP> A single 6-pin can deliver 75W, an 8-pin 150W, so in your case,
>>>>> the hard
>>>>>
>>>>> SP> limits of what your card can pull is 75W from the PCI-E slow +
>>>>> 150W from
>>>>>
>>>>> SP> the cable = 225 W. With a single 6-pin cable you'll only get ~150W
>>>>> max.
>>>>>
>>>>> SP> That can be OK if your card does not pull more power (e.g. the
>>>>> above
>>>>>
>>>>> SP> non-overclocked card would be just fine), but as your card is
>>>>> overclocked,
>>>>>
>>>>> SP> I'm not sure it won't peak above 150W.
>>>>>
>>>>>
>>>>> SP> You can try to get a molex -> PCI-E power cable converter.
>>>>>
>>>>>
>>>>>
>>>>> SP> --
>>>>>
>>>>> SP> Szilárd
>>>>>
>>>>>
>>>>>
>>>>> SP> On Mon, Jun 29, 2015 at 9:56 PM, Alex <nedomacho at gmail.com> wrote:
>>>>>
>>>>>
>>>>> >> Hi all,
>>>>>
>>>>> >>
>>>>>
>>>>> >> I have a bit of a gromacs-unrelated question here, but I think this
>>>>> is a
>>>>>
>>>>> >> better place to ask it than, say, a gaming forum. The Nvidia GTX
>>>>> 960 card
>>>>>
>>>>> >> we got here came with an 8-pin AUX connector on the card side, which
>>>>>
>>>>> >> interfaces _two_ 6-pin connectors to the PSU. It is a factory
>>>>> superclocked
>>>>>
>>>>> >> card. My 525W PSU can only populate _one_ of those 6-pin
>>>>> connectors. The
>>>>>
>>>>> >> EVGA website states that I need at least 400W PSU, while I have 525.
>>>>>
>>>>> >>
>>>>>
>>>>> >> At the same time, I have a dedicated high-power PCI-e slot, which
>>>>> on the
>>>>>
>>>>> >> motherboard says "75W PCI-e". Do I need a different PSU to populate
>>>>> the AUX
>>>>>
>>>>> >> power connector completely? Are these runs equivalent to drawing
>>>>> max power
>>>>>
>>>>> >> during gaming?
>>>>>
>>>>> >>
>>>>>
>>>>> >> Thanks!
>>>>>
>>>>> >>
>>>>>
>>>>> >> Alex
>>>>>
>>>>> >> --
>>>>>
>>>>> >> Gromacs Users mailing list
>>>>>
>>>>> >>
>>>>>
>>>>> >> * Please search the archive at
>>>>>
>>>>> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>
>>>>> >> posting!
>>>>>
>>>>> >>
>>>>>
>>>>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> >>
>>>>>
>>>>> >> * For (un)subscribe requests visit
>>>>>
>>>>> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>> or
>>>>>
>>>>> >> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Gromacs Users mailing list
>>>>>
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>>
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Alex mailto:nedomacho at gmail.com
>>>>> <nedomacho at gmail.com>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Alex mailto:nedomacho at gmail.com
>>>>> <nedomacho at gmail.com>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Alex mailto:nedomacho at gmail.com
>>>>> <nedomacho at gmail.com>
>>>>>
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>
>>>>
>>>
>>
>
More information about the gromacs.org_gmx-users
mailing list