[gmx-users] Improving GPU performance on Bridges HPC cluster

Wed Sep 7 02:44:45 CEST 2016

Hi,

Have you checked onlinelibrary.wiley.com/doi/10.1002/jcc.24030/full,
especially Fig 8?

A few things I noticed while scrolling through briefly:
- You-re not pinning threads in mdrun; (the excessive CPU-GPU
balancing with quarter vs full node runs is suspicious)/
- Typically multiple ranks per GPU (with DD) is beneficial. The factor
in the core count 7 makes that harder to accomplish. You'll have to
either under-utilize the 4th GPU of the node with 7 ranks x 4 threads
or 14 ranks x 2 threads or leave cores empty cores with 8 ranks x 3
threads (but I doubt this will be worth it).
- You're using vsites running with 2fs time-step?

Note that scaling will take a noticeable hit from one to multiple
ranks (due to DD), so also compare scaling from e.g. 1 CPU+2 GPUs to 2
CPUs + 4 GPUs.

Cheers,
--
Szilárd

On Tue, Sep 6, 2016 at 8:39 PM, Benjamin Joseph Coscia
<Benjamin.Coscia at colorado.edu> wrote:
> Hello Gromacs users,
>
> Our group has begun running simulations on the XSEDE resource, Bridges, and
> we are trying to maximize our performance on the GPU nodes. The nodes are
> configured so that there are two Tesla K80 accelerators each consisting of
> 2 GK210 GPUs. Additionally, there are two CPU's on the node, each with 14
> cores.
>
> Generally when I run on any node, I've found the best performance occurs
> when I assign 7 cores per MPI process. On the GPU nodes, I have been giving
> each MPI process one GPU to work with. A representative slurm submission
> script (Run.sh) which used one full GPU node (4 GPUs, 28 CPU cores) is
> contained in the folder shared through dropbox at the end of this email.
>
> I've turned dynamic load balancing on, although I think it turns on by
> default so I didn't see a performance difference there.
>
> The systems for which I have scaling data are both membrane systems. One is
> an ordered membrane and the unit cell is heterogeneous. There is vacuum on
> the top and bottom of the system. It has ~65000 atoms. The second system is
> an amorphous membrane which is homogeneous and contains about 139000 atoms.
>
> The dropbox link contains the results of the scaling studies
> (Bridges_GPU_scaling.ods) I've done on a single node with varied numbers of
> GPUs (7 CPU cores allocated per GPU) and varied PP:PME loads. Generally, I
> did not see any significant performance increase (usually a decrease) from
> varying the PP:PME ranks. Also, the scaling from 1 to 4 GPUs does not seem
> to be too great.
>
> I've also included select .log files and .out files from slurm along with
> input files which can be used to reproduce both systems.
>
> Maybe GROMACS algorithms are doing a good job figuring out the optimal run
> conditions given that I am unable to beat performance using the default
> settings, but I would think there is a way to get better performance since
> I know a lot about the systems.
>
> Please let me know if you have any suggestions on how to further increase
> performance. We'd like to implement recommendations into our own systems
> and also pass the information on to people who work on Bridges so that they
> can put forth some best practices.
>
> Please use this link:
> https://www.dropbox.com/s/kmy7d15dijvtr2j/GPU_jobs.tgz?dl=0 to access all
> files necessary to reproduce my simulations.
>
> Best,
>
> Ben Coscia
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.