[gmx-users] Question about GPU acceleration in GROMACS 5

Tomy van Batis tomyvanbatis at gmail.com
Mon Dec 15 09:57:05 CET 2014


Hi Carsten and Mark

Thank you both for your detailed replies.

Tommy

On Fri, Dec 12, 2014 at 5:13 PM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:
>
> On Fri, Dec 12, 2014 at 3:47 PM, Tomy van Batis <tomyvanbatis at gmail.com>
> wrote:
> >
> > Hi Mark
> >
> > Thanks for your detailed reposponce.
> >
> > I still don't see the reason for the GPU loading to be only around 50%,
> but
> > also why does this number increases with increasing CPU cores.
> >
> > For example, when using 1 CPU (-ntomp 1 i nthe mdrun) , the GPU loading
> is
> > only about 25-30%, although with 4 CPU cores the GPU loading is 55%.
> >
>
> Your system runs like this
>
> 1. Do forces; so short-range on the GPU (~50% of the time) and angles on
> the CPU (~5% of the time, then ~45% idle)
> 2. Do constraints, updates, neighbour search and house keeping on the CPU
> (~50% of the time, including data transfer costs) with the GPU idle (~50%)
> 3. Repeat
>
> so adding more CPU cores makes 2 take less time. You can see this by doing
> a diff on the tables at the ends of the log files. A PME simulation looks
> rather different.
>
>
> > Considering that the work done on the GPU takes a lot longer that the one
> > on the CPU, I believe the GPU loading should not change when changing the
> > number of openmp threads. Is this correct or do I miss something here?
> >
>
> True for 1, but not for 2.
>
>
> > Addtionally, I don't really see the reason that the GPU is not loaded
> 100%.
> > Is this because of the system size?
> >
>
> As Carsten said, we optimize for throughput, not utilization. On a single
> node, you could do everything on the GPU (as e.g. AMBER 14 does) and now
> utilization would approach peak (and throughput would go up in that case,
> if someone wrote a big pile of code to make it happen). But that
> implementation would struggle scale to more nodes with current hardware
> technology, and is tough to make work well with multiple GPUs per node
> (some WIP, but focused on 1).
>
> Mark
>
>
> > Tommy
> >
> >
> >
> > *Hi,*
> > >
> > > *Only the short-ranged non-bonded work is offloaded to the GPU, but
> > that's*
> > > *almost all the force-based work you are doing. So it is entirely*
> > > *unsurprising that the work done on the GPU takes a lot longer than it
> > > does*
> > > *on the CPU. That warning is aimed at the more typical PME-based
> > > simulation*
> > > *where the long-ranged part is done on the CPU, and now there is load
> to*
> > > *balance. Running constraints+update happens only on the CPU, which
> > is**always
> > > a bottleneck, and worse in your case.*
> > >
> > > *Ideally, we'd share some load that your simulations are doing solely
> on
> > > the*
> > > *GPU with the CPU, and/or do the update on the GPU, but none of
> > the**infrastructure
> > > is there for that.*
> > > *Mark*
> >
> >
> > On Fri, Dec 12, 2014 at 2:00 PM, Tomy van Batis <tomyvanbatis at gmail.com>
> > wrote:
> > >
> > > Dear all
> > >
> > > I am working with a system of about 200.000 particles. All the
> non-bonded
> > > interactions on the system are Lennard-Jones type (no Coulomb). I
> > constrain
> > > the bond-length with Lincs. No torsion or bending interactions are
> taken
> > > into account.
> > >
> > >
> > > I am running the simulations on a 4-core Xeon® E5-1620 vs @ 3.70GHz
> > > together with an NVIDIA Tesla K20Xm. I observe a strange behavior when
> > > looking to performance of the simulations:
> > >
> > >
> > > 1. Running in 4 cores+gpu
> > >
> > > GPU/CPU force evaluation time=9.5 and GPU usage=58% (I see that with
> the
> > > command nvidia-smi)
> > >
> > >
> > > [image: Inline image 1]
> > >
> > >
> > >
> > > 2. Running in 2 cores+gpu
> > >
> > > GPU/CPU force evaluation time=9.9 and GPU usage=45-50% (Image is not
> > > included due to size restrictions)
> > >
> > >
> > >
> > > The situation doesn't change if I include the option -nd gpu (or
> gpu_cpu)
> > > in the mdrun.
> > >
> > >
> > > I can see in the mailing list that the force evaluation time should be
> > > about 1, that means that I am far away from the optimal performance.
> > >
> > >
> > > Does anybody have any suggestions about how to improve the
> computational
> > > speed?
> > >
> > >
> > > Thanks in advance,
> > >
> > > Tommy
> > >
> > >
> > >
> > >
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list