[gmx-developers] Gromacs with GPU

Szilárd Páll pall.szilard at gmail.com
Fri Sep 22 14:50:04 CEST 2017


Indeed, the device selection does not take into account locality (nor
does other things like picking the "fast" GPUs in mixed environments).

In an ideal world, even without hwloc (using nvml) NUMA distances can
be queried, but we have neither prioritized doing this nor am I
certain if locality tuning is something that will help more than
confuse users.

Most often when mdrun is not mean to be running on all cores/GPUs and
therefore there is flexibility to picking some hardware or rank to
core/GPU placement in a node, it's actually a node-sharing scenario.
Here, the main issue is not that we can't (with fairly small amount of
extra code) pick the right resources, but that we can't magically
figure out what the user/job scheduler meant if they were not clear
enough about it. Users of course would ideally want to have the cake
and eat it too: have magically use the right cores/GPUs without having
to fiddle with affinity, placement...

To conclude, there is room for placement optimization, but I'm not
certain how often would one under-utilize a node (with exclusive
scheduling). Perhaps for power-optimization this would make sense (as
GROMACS can often be CPU-bound)?

Cheers,
--
Szilárd


On Fri, Sep 22, 2017 at 1:10 PM, Åke Sandgren <ake.sandgren at hpc2n.umu.se> wrote:
> Hi!
>
> I am seeing a possible performance enhancement (possibly) when running
> gromacs on nodes with multiple gpu cards.
> (And yes I know this is perhaps a mote point since current GPU cards
> don't have dual engines per card)
>
> System:
> dual socket 14-core broadwell cpus
> 2 K80 cards, one on each socket.
>
> Gromacs built with hwloc support.
>
> When running a dual node (56 core)
>
> gmx_mpi mdrun -npme 4 -s ion_channel_bench00.tpr -resetstep 20000 -o
> bench.trr -x bench.xtc -cpo bench.cpt -c bench.gro -e bench.edr -g
> bench.log -ntomp 7 -pin on -dlb yes
>
> job, (slurm + cgroups), gromacs doesn't fully take hwloc info into
> account. The job correctly gets allocated on cores, but looking at
> nvidia-smi and hwloc-ps i can see that the PP processes are using a
> suboptimal selection of GPU engines.
>
> The PP processes are placed one on each CPU socket (according to which
> process-ids are using the GPUs and the position of those pids according
> to hwloc-ps), but they both uses gpu engines from the same (first) K80 card.
>
> It would be better to have looked at the hwloc info and selected CUDA
> devices 0,2 (or 1,3) instead of 0,1.
>
>
> Any comments on that?
>
> Attached nvidia-smi + hwloc-ps output
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>
> --
> Gromacs Developers mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or send a mail to gmx-developers-request at gromacs.org.


More information about the gromacs.org_gmx-developers mailing list