[gmx-users] Running Gromacs in parallel

Szilárd Páll pall.szilard at gmail.com
Thu Sep 22 16:26:38 CEST 2016


On Wed, Sep 21, 2016 at 9:55 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
> Thanks Sz.
>
> Do you think going up to from version 5.0.4 to 5.1.4 would really make
> such a big difference?

Note that I was recommending using a modern compiler + the latest
release (which is called 2016 not 5.1.4!). It's hard to guess the
improvements, but from 5.0->2016 you should see double-digit
percentage improvements and going from gcc 4.4 to 5.x or 6.0 wil also
have a significant improvement.

> Here is a log file from a single md run (that has finished unlike the
> metadynamics) with the number of OpenMP threads matching how many threads
> there are on each node. This has been restarted a number of times with
> different launch configurations being mostly the number of nodes and the
> node type (either 8 CPUs or 24 CPUs).
> https://www.dropbox.com/s/uxzsj3pm31n66nz/md.log?dl=0

You seem to be using a single MPI rank per node ion these runs. That
will almost never be optimal, especially not when DD is not limited.

> From timesteps when checkpoints were written I can see that these
> configurations make quite a difference and per CPU, having 8 OpenMP
> threads per MPI process becomes a much worse idea stepping from 4 nodes to
> 6 nodes, i.e. having more CPUs makes mixed paralellism less favourable as
> suggested in figure 8. Yes, the best may not lie at 1 OpenMP thread per
> MPI rank and may vary depending on the number of CPUs as well.

Sure, but 8 threads panning over two sockets will definitely be
suboptimal. Start with trying fewer and consider using separate PME
ranks especially if you have ethernet.

> Also, I can
> see that for the same number of CPUs, the 24-thread nodes are better than
> the 8-thread nodes but I can't get so many of them as they are also more
> popular for RELION users.

FYI those are 2x6-core CPUs with Hyperthreading, so 2x12 hardware
threads. Also the two generations newer, so it's not surprising that
they are much faster. Still, 24 threads/node is too much. Use less.

> What can I infer from the information at the
> end?

Before starting to interpret that, it's worth fixing the above issues ;)
Otherwise, what's clear is that PME is taking a considerable amount of
time, especially given the long cut-off.

Cheers,
--
Szilárd


>
> Best wishes
> James
>
>> Hi,
>>
>> On Wed, Sep 21, 2016 at 5:44 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>> Hi Szilárd,
>>>
>>> Yes I had looked at it but not with our cluster in mind. I now have a
>>> couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan X
>>> GPU on one system and two Titan X GPUs on the other), and have been
>>> thinking about about getting the most out of them. I listened to
>>> Carsten's
>>> BioExcel webinar this morning and it got me thinking about the cluster
>>> as
>>> well. I've just had a quick look now and it suggests Nrank = Nc and Nth
>>> =
>>> 1 for high core count, which I think worked slightly less well for me
>>> but
>>> I can't find the details so I may be remembering wrong.
>>
>> That's not unexpected, the reported values are specific to the
>> hardware and benchmark systems and only give a rough idea where the
>> ranks/threads balance should be.
>>>
>>> I don't have log files from a systematic benchmark of our cluster as it
>>> isn't really available enough for doing that.
>>
>> That's not really necessary, even logs from a single production run
>> can hint possible improvements.
>>
>>> I haven't tried gmx tune_pme
>>> on there either. I do have node-specific installations of gromacs-5.0.4
>>> but I think they were done with gcc-4.4.7 so there's room for
>>> improvement
>>> there.
>>
>> If that's the case, I'd simply recommend using a modern compiler and
>> if you can a recent GROMACS version, you'll gain more performance than
>> from most launch config tuning.
>>
>>> The cluster nodes I have been using have the following cpu specs
>>> and 10Gb networking. It could be that using 2 OpenMP threads per MPI
>>> rank
>>> works nicely because it matches the CPU configuration and makes better
>>> use
>>> of hyperthreading.
>>
>> Or because of the network. Or for some other reason. Again, comparing
>> the runs' log files could tell more :)
>>
>>> Architecture:          x86_64
>>> CPU op-mode(s):        32-bit, 64-bit
>>> Byte Order:            Little Endian
>>> CPU(s):                8
>>> On-line CPU(s) list:   0-7
>>> Thread(s) per core:    2
>>> Core(s) per socket:    2
>>> Socket(s):             2
>>> NUMA node(s):          2
>>> Vendor ID:             GenuineIntel
>>> CPU family:            6
>>> Model:                 26
>>> Model name:            Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
>>> Stepping:              5
>>> CPU MHz:               2393.791
>>> BogoMIPS:              4787.24
>>> Virtualization:        VT-x
>>> L1d cache:             32K
>>> L1i cache:             32K
>>> L2 cache:              256K
>>> L3 cache:              8192K
>>> NUMA node0 CPU(s):     0,2,4,6
>>> NUMA node1 CPU(s):     1,3,5,7
>>>
>>> I appreciate that a lot is system-dependent and that I can't really help
>>> you help me very much. It also should be noted that my multi runs are
>>> multiple walker metadynamics run and are slowing down because there are
>>> large bias potentials in memory that need to be communicated around too.
>>> As I said I haven't had a chance to make separate benchmark runs but
>>> have
>>> just made observations based upon existing runs.
>>
>> Understandable, I was just giving tips and hints.
>>
>> Cheers,
>> --
>> Sz.
>>
>>
>>> Best wishes
>>> James
>>>
>>>> Performance tuning is highly dependent on the simulation system and
>>>> the hardware you're running on. Questions like the ones you pose are
>>>> impossible to answer meaningfully without *full* log files (and
>>>> hardware specs including network).
>>>>
>>>> Have you checked the performance checklist I linked above?
>>>> --
>>>> SzilÃĄrd
>>>>
>>>>
>>>> On Wed, Sep 21, 2016 at 11:36 AM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>>>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes
>>>>> from
>>>>> using -multi 6 with 8-CPU nodes. That level of parallelism may then be
>>>>> necessary to trigger automatic segregation of PP and PME ranks. I'm
>>>>> not
>>>>> sure if I tried -np 54 and -ntomp 4, which would probably also do it.
>>>>> I
>>>>> compared mostly on 196 CPUs then found going up to 216 was better than
>>>>> 196
>>>>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both.
>>>>> Would people recommend to go back to 196 which allows 4 whole nodes
>>>>> per
>>>>> replica and playing with -npme and -ntomp_pme?
>>>>>
>>>>>> Hi Thanh Le,
>>>>>>
>>>>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you
>>>>>> could
>>>>>> try the following
>>>>>>
>>>>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ...
>>>>>> mpirun -np 18 mdrun -ntomp 6 ...
>>>>>> mpirun -np 54 mdrun -ntomp 2 ...
>>>>>>
>>>>>> Which of these works best will depend on your setup.
>>>>>>
>>>>>> Using the whole cluster for one job may not be the most efficient
>>>>>> way.
>>>>>> I
>>>>>> found on our cluster that once I reach 216 CPUs (equivalent settings
>>>>>> from
>>>>>> the queuing system to -np 108 and -ntomp 2), I can't do better by
>>>>>> adding
>>>>>> more nodes (where presumably communication becomes an issue). In
>>>>>> addition
>>>>>> to running -multi or -multidir jobs, which takes the load off
>>>>>> communication a bit, it may also be worth having separate jobs and
>>>>>> using
>>>>>> -pin on and -pinoffset.
>>>>>>
>>>>>> Best wishes
>>>>>> James
>>>>>>
>>>>>>> Hi everyone,
>>>>>>> I have a question concerning running gromacs in parallel. I have
>>>>>>> read
>>>>>>> over
>>>>>>> the
>>>>>>> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
>>>>>>> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html>
>>>>>>> but I still dont quite understand how to run it efficiently.
>>>>>>> My gromacs version is 4.5.4
>>>>>>> The cluster I am using has CPUs total: 108 and 4 hosts up.
>>>>>>> The node iam using:
>>>>>>> Architecture:          x86_64
>>>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>>>> Byte Order:            Little Endian
>>>>>>> CPU(s):                12
>>>>>>> On-line CPU(s) list:   0-11
>>>>>>> Thread(s) per core:    2
>>>>>>> Core(s) per socket:    6
>>>>>>> Socket(s):             1
>>>>>>> NUMA node(s):          1
>>>>>>> Vendor ID:             AuthenticAMD
>>>>>>> CPU family:            21
>>>>>>> Model:                 2
>>>>>>> Stepping:              0
>>>>>>> CPU MHz:               1400.000
>>>>>>> BogoMIPS:              5200.57
>>>>>>> Virtualization:        AMD-V
>>>>>>> L1d cache:             16K
>>>>>>> L1i cache:             64K
>>>>>>> L2 cache:              2048K
>>>>>>> L3 cache:              6144K
>>>>>>> NUMA node0 CPU(s):     0-11
>>>>>>> MPI is already installed. I also have permission to use the cluster
>>>>>>> as
>>>>>>> much as I can.
>>>>>>> My question is: how should I write my mdrun command run to utilize
>>>>>>> all
>>>>>>> the
>>>>>>> possible cores and nodes?
>>>>>>> Thanks,
>>>>>>> Thanh Le
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>> or
>>>>>>> send
>>>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send
>>>> a mail to gmx-users-request at gromacs.org.
>>>
>>>
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list