[gmx-users] [Performance] poor performance with NV V100

Szilárd Páll pall.szilard at gmail.com
Wed Oct 16 14:07:09 CEST 2019


Hi,

Please keep the conversation on the mailing list.

GROMACS uses both CPUs and GPUs for computation. Your runs limit core count
per rank, and do so in a way that the rest of the cores are left idle. This
is not a suitable approach for realistic benchmarking due to the clock
boosting skewing your scaling results.

Secondly, you should consider using PME offload as well, see the docs and
previous discussion on the list how to do so.

Last, if you are evaluating hardware for some use-cases, do make sure you
set up your benchmarks such that they reflect the intended use cases (e.g.
scaling vs throughput), and please check out the best practices for how to
run GROMACS on GPU servers.

You might also be interested in a recent study we did:
https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.26011

Cheers,

--
Szilárd


On Tue, Oct 8, 2019 at 3:00 PM Jimmy Chen <catjmc at gmail.com> wrote:

> Hi Szilard,
>
> Thanks for your help.
> Is md.log enough for you to clarify where the bottleneck is located?
> If you need another log, please let me know.
>
> I just checked the release note of 2019.4, I didn't see any major release
> impact the performance of intra-node.
>
> http://manual.gromacs.org/documentation/2020-beta1/release-notes/2019/2019.4.html
>
> anyway, I will have a try on 2019.4 later.
>
> looking forward to check new feature which will be on 2/3 beta release of
> 2020.
>
> Best regards,
> Jimmy
>
>
> Szilárd Páll <pall.szilard at gmail.com> 於 2019年10月8日 週二 下午8:34寫道:
>
>> Hi,
>>
>> Can you please share your log files? we may be able to help with spotting
>> performance issues or bottlenecks.
>> However, note that for NVIDIA are the best source to aid you with
>> reproducing their benchmark numbers, we
>>
>> Scaling across multiple GPUs requires some tuning of command line options,
>> please see the related discussion on the list ((briefly: use multiple
>> ranks
>> per GPU, and one separate PME rank with GPU offload).
>>
>> Also note that intra-node strong scaling optimization target of recent
>> releases (there are no p2p optimizations either), however new features
>> going into the 2020 release will improve things significantly. Keep an eye
>> out on the beta2/3 releases if you are interested in checking out the new
>> features.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Mon, Oct 7, 2019 at 7:48 AM Jimmy Chen <catjmc at gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I'm using NV v100 to evaluate if it's suitable to do purchase.
>> > But I can't get similar test result as referenced performance data
>> > which was got from internet.
>> > https://developer.nvidia.com/hpc-application-performance
>> >
>> >
>> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
>> >
>> >
>> > No matter using docker tag 18.02 from
>> > https://ngc.nvidia.com/catalog/containers/hpc:gromacs/tags
>> >
>> > or gromacs source code from
>> > ftp://ftp.gromacs.org/pub/gromacs/gromacs-2019.3.tar.gz
>> >
>> > test data set is ADH dodec and water 1.5M
>> > gmx grompp -f pme_verlet.mdp
>> > gmx mdrun -ntmpi 1 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
>> topol.tpr
>> > -ntomp 4
>> > and  gmx mdrun -ntmpi 2 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
>> > topol.tpr -ntomp 4
>> >
>> > My CPU is Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>> > and GPU is NV V100 16GB PCIE.
>> >
>> > For ADH dodec,
>> > The perf data of 2xV100 16GB PCIE in
>> > https://developer.nvidia.com/hpc-application-performance is 176
>> (ns/day).
>> > But I only can get 28 (ns/day). actually I can get 67(ns/day) with
>> 1xV100.
>> > I don't know why I got poorer result with 2xV100.
>> >
>> > For water 1.5M
>> > The perf data of 1xV100 16GB PCIE in
>> >
>> >
>> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
>> > is
>> > 9.83(ns/day) and 2xV100 is 10.41(ns/day).
>> > But what I got is 6.5(ns/day) with 1xV100 and 2(ns/day) with 2xV100.
>> >
>> > Could anyone give me some suggestions about how to clarify what's
>> problem
>> > to result to this perf data in my environment? Is my command to perform
>> the
>> > testing wrong? any suggested command to perform the testing?
>> > or which source code version is recommended to use now?
>> >
>> > btw, after checking the code, it seems MPI doesn't go through PCIE P2p
>> or
>> > RDMA, is it correct? any plan to implement this in MPI?
>> >
>> > Best regards,
>> > Jimmy
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>
>


More information about the gromacs.org_gmx-users mailing list