[gmx-users] Running Gromacs in parallel

Fri Sep 23 14:20:22 CEST 2016

On Thu, Sep 22, 2016 at 6:39 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
> Thanks again
>
>> On Wed, Sep 21, 2016 at 9:55 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>> Thanks Sz.
>>>
>>> Do you think going up to from version 5.0.4 to 5.1.4 would really make
>>> such a big difference?
>>
>> Note that I was recommending using a modern compiler + the latest
>> release (which is called 2016 not 5.1.4!). It's hard to guess the
>> improvements, but from 5.0->2016 you should see double-digit
>> percentage improvements and going from gcc 4.4 to 5.x or 6.0 wil also
>> have a significant improvement.
>
> I was still thinking of 2016 as too new to be used for simulations I might
> want to publish. I will try it when I can then.

Somebody has to try it otherwise it stays as new as it was at the day
of the release, won't it?

>>
>>> Here is a log file from a single md run (that has finished unlike the
>>> metadynamics) with the number of OpenMP threads matching how many
>>> threads
>>> there are on each node. This has been restarted a number of times with
>>> different launch configurations being mostly the number of nodes and the
>>> node type (either 8 CPUs or 24 CPUs).
>>> https://www.dropbox.com/s/uxzsj3pm31n66nz/md.log?dl=0
>>
>> You seem to be using a single MPI rank per node ion these runs. That
>> will almost never be optimal, especially not when DD is not limited.
>
> Yes, I only realised that recently and I thought it might be useful to see
> this log seeing as it is a complete run and has the bit at the bottom.
> Here is a multiple walker metadynamics log, includes some other
> combinations I tried.
>
> https://www.dropbox.com/s/td7ps45dzz1otwz/from_cluster_metad0.log?dl=0

No performance data is printed there so it's hard to say anything. I
suggest you do separate short benchmark runs if you want to learn
about or tune performance.

>>
>>> From timesteps when checkpoints were written I can see that these
>>> configurations make quite a difference and per CPU, having 8 OpenMP
>>> threads per MPI process becomes a much worse idea stepping from 4 nodes
>>> to
>>> 6 nodes, i.e. having more CPUs makes mixed paralellism less favourable
>>> as
>>> suggested in figure 8. Yes, the best may not lie at 1 OpenMP thread per
>>> MPI rank and may vary depending on the number of CPUs as well.
>>
>> Sure, but 8 threads panning over two sockets will definitely be
>> suboptimal. Start with trying fewer and consider using separate PME
>> ranks especially if you have ethernet.
>
> ok
>
>>
>>> Also, I can
>>> see that for the same number of CPUs, the 24-thread nodes are better
>>> than
>>> the 8-thread nodes but I can't get so many of them as they are also more
>>> popular for RELION users.
>>
>> FYI those are 2x6-core CPUs with Hyperthreading, so 2x12 hardware
>> threads. Also the two generations newer, so it's not surprising that
>> they are much faster. Still, 24 threads/node is too much. Use less.
>>
>>> What can I infer from the information at the
>>> end?
>>
>> Before starting to interpret that, it's worth fixing the above issues ;)
>> Otherwise, what's clear is that PME is taking a considerable amount of
>> time, especially given the long cut-off.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>>>
>>> Best wishes
>>> James
>>>
>>>> Hi,
>>>>
>>>> On Wed, Sep 21, 2016 at 5:44 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>>>> Hi Szilárd,
>>>>>
>>>>> Yes I had looked at it but not with our cluster in mind. I now have a
>>>>> couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan
>>>>> X
>>>>> GPU on one system and two Titan X GPUs on the other), and have been
>>>>> thinking about about getting the most out of them. I listened to
>>>>> Carsten's
>>>>> BioExcel webinar this morning and it got me thinking about the cluster
>>>>> as
>>>>> well. I've just had a quick look now and it suggests Nrank = Nc and
>>>>> Nth
>>>>> =
>>>>> 1 for high core count, which I think worked slightly less well for me
>>>>> but
>>>>> I can't find the details so I may be remembering wrong.
>>>>
>>>> That's not unexpected, the reported values are specific to the
>>>> hardware and benchmark systems and only give a rough idea where the
>>>> ranks/threads balance should be.
>>>>>
>>>>> I don't have log files from a systematic benchmark of our cluster as
>>>>> it
>>>>> isn't really available enough for doing that.
>>>>
>>>> That's not really necessary, even logs from a single production run
>>>> can hint possible improvements.
>>>>
>>>>> I haven't tried gmx tune_pme
>>>>> on there either. I do have node-specific installations of
>>>>> gromacs-5.0.4
>>>>> but I think they were done with gcc-4.4.7 so there's room for
>>>>> improvement
>>>>> there.
>>>>
>>>> If that's the case, I'd simply recommend using a modern compiler and
>>>> if you can a recent GROMACS version, you'll gain more performance than
>>>> from most launch config tuning.
>>>>
>>>>> The cluster nodes I have been using have the following cpu specs
>>>>> and 10Gb networking. It could be that using 2 OpenMP threads per MPI
>>>>> rank
>>>>> works nicely because it matches the CPU configuration and makes better
>>>>> use
>>>>> of hyperthreading.
>>>>
>>>> Or because of the network. Or for some other reason. Again, comparing
>>>> the runs' log files could tell more :)
>>>>
>>>>> Architecture:          x86_64
>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>> Byte Order:            Little Endian
>>>>> CPU(s):                8
>>>>> On-line CPU(s) list:   0-7
>>>>> Thread(s) per core:    2
>>>>> Core(s) per socket:    2
>>>>> Socket(s):             2
>>>>> NUMA node(s):          2
>>>>> Vendor ID:             GenuineIntel
>>>>> CPU family:            6
>>>>> Model:                 26
>>>>> Model name:            Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
>>>>> Stepping:              5
>>>>> CPU MHz:               2393.791
>>>>> BogoMIPS:              4787.24
>>>>> Virtualization:        VT-x
>>>>> L1d cache:             32K
>>>>> L1i cache:             32K
>>>>> L2 cache:              256K
>>>>> L3 cache:              8192K
>>>>> NUMA node0 CPU(s):     0,2,4,6
>>>>> NUMA node1 CPU(s):     1,3,5,7
>>>>>
>>>>> I appreciate that a lot is system-dependent and that I can't really
>>>>> help
>>>>> you help me very much. It also should be noted that my multi runs are
>>>>> multiple walker metadynamics run and are slowing down because there
>>>>> are
>>>>> large bias potentials in memory that need to be communicated around
>>>>> too.
>>>>> As I said I haven't had a chance to make separate benchmark runs but
>>>>> have
>>>>> just made observations based upon existing runs.
>>>>
>>>> Understandable, I was just giving tips and hints.
>>>>
>>>> Cheers,
>>>> --
>>>> Sz.
>>>>
>>>>
>>>>> Best wishes
>>>>> James
>>>>>
>>>>>> Performance tuning is highly dependent on the simulation system and
>>>>>> the hardware you're running on. Questions like the ones you pose are
>>>>>> impossible to answer meaningfully without *full* log files (and
>>>>>> hardware specs including network).
>>>>>>
>>>>>> Have you checked the performance checklist I linked above?
>>>>>> --
>>>>>> SzilÃĄrd
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 21, 2016 at 11:36 AM,  <jkrieger at mrc-lmb.cam.ac.uk>
>>>>>> wrote:
>>>>>>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes
>>>>>>> from
>>>>>>> using -multi 6 with 8-CPU nodes. That level of parallelism may then
>>>>>>> be
>>>>>>> necessary to trigger automatic segregation of PP and PME ranks. I'm
>>>>>>> not
>>>>>>> sure if I tried -np 54 and -ntomp 4, which would probably also do
>>>>>>> it.
>>>>>>> I
>>>>>>> compared mostly on 196 CPUs then found going up to 216 was better
>>>>>>> than
>>>>>>> 196
>>>>>>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for
>>>>>>> both.
>>>>>>> Would people recommend to go back to 196 which allows 4 whole nodes
>>>>>>> per
>>>>>>> replica and playing with -npme and -ntomp_pme?
>>>>>>>
>>>>>>>> Hi Thanh Le,
>>>>>>>>
>>>>>>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you
>>>>>>>> could
>>>>>>>> try the following
>>>>>>>>
>>>>>>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ...
>>>>>>>> mpirun -np 18 mdrun -ntomp 6 ...
>>>>>>>> mpirun -np 54 mdrun -ntomp 2 ...
>>>>>>>>
>>>>>>>> Which of these works best will depend on your setup.
>>>>>>>>
>>>>>>>> Using the whole cluster for one job may not be the most efficient
>>>>>>>> way.
>>>>>>>> I
>>>>>>>> found on our cluster that once I reach 216 CPUs (equivalent
>>>>>>>> settings
>>>>>>>> from
>>>>>>>> the queuing system to -np 108 and -ntomp 2), I can't do better by
>>>>>>>> adding
>>>>>>>> more nodes (where presumably communication becomes an issue). In
>>>>>>>> addition
>>>>>>>> to running -multi or -multidir jobs, which takes the load off
>>>>>>>> communication a bit, it may also be worth having separate jobs and
>>>>>>>> using
>>>>>>>> -pin on and -pinoffset.
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> James
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>> I have a question concerning running gromacs in parallel. I have
>>>>>>>>> read
>>>>>>>>> over
>>>>>>>>> the
>>>>>>>>> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
>>>>>>>>> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html>
>>>>>>>>> but I still dont quite understand how to run it efficiently.
>>>>>>>>> My gromacs version is 4.5.4
>>>>>>>>> The cluster I am using has CPUs total: 108 and 4 hosts up.
>>>>>>>>> The node iam using:
>>>>>>>>> Architecture:          x86_64
>>>>>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>>>>>> Byte Order:            Little Endian
>>>>>>>>> CPU(s):                12
>>>>>>>>> On-line CPU(s) list:   0-11
>>>>>>>>> Thread(s) per core:    2
>>>>>>>>> Core(s) per socket:    6
>>>>>>>>> Socket(s):             1
>>>>>>>>> NUMA node(s):          1
>>>>>>>>> Vendor ID:             AuthenticAMD
>>>>>>>>> CPU family:            21
>>>>>>>>> Model:                 2
>>>>>>>>> Stepping:              0
>>>>>>>>> CPU MHz:               1400.000
>>>>>>>>> BogoMIPS:              5200.57
>>>>>>>>> Virtualization:        AMD-V
>>>>>>>>> L1d cache:             16K
>>>>>>>>> L1i cache:             64K
>>>>>>>>> L2 cache:              2048K
>>>>>>>>> L3 cache:              6144K
>>>>>>>>> NUMA node0 CPU(s):     0-11
>>>>>>>>> MPI is already installed. I also have permission to use the
>>>>>>>>> cluster
>>>>>>>>> as
>>>>>>>>> much as I can.
>>>>>>>>> My question is: how should I write my mdrun command run to utilize
>>>>>>>>> all
>>>>>>>>> the
>>>>>>>>> possible cores and nodes?
>>>>>>>>> Thanks,
>>>>>>>>> Thanh Le
>>>>>>>>> --
>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>
>>>>>>>>> * Please search the archive at
>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>>>> posting!
>>>>>>>>>
>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>
>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>> or
>>>>>>>>> send
>>>>>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>> or
>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send
>>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send
>>>> a mail to gmx-users-request at gromacs.org.
>>>
>>>
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.