[gmx-users] Running Gromacs in parallel

Wed Sep 21 21:55:27 CEST 2016

Thanks Sz.

Do you think going up to from version 5.0.4 to 5.1.4 would really make
such a big difference?

Here is a log file from a single md run (that has finished unlike the
metadynamics) with the number of OpenMP threads matching how many threads
there are on each node. This has been restarted a number of times with
different launch configurations being mostly the number of nodes and the
node type (either 8 CPUs or 24 CPUs).
https://www.dropbox.com/s/uxzsj3pm31n66nz/md.log?dl=0

>From timesteps when checkpoints were written I can see that these
configurations make quite a difference and per CPU, having 8 OpenMP
threads per MPI process becomes a much worse idea stepping from 4 nodes to
6 nodes, i.e. having more CPUs makes mixed paralellism less favourable as
suggested in figure 8. Yes, the best may not lie at 1 OpenMP thread per
MPI rank and may vary depending on the number of CPUs as well. Also, I can
see that for the same number of CPUs, the 24-thread nodes are better than
the 8-thread nodes but I can't get so many of them as they are also more
popular for RELION users. What can I infer from the information at the
end?

Best wishes
James

> Hi,
>
> On Wed, Sep 21, 2016 at 5:44 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>> Hi SzilÃ¡rd,
>>
>> Yes I had looked at it but not with our cluster in mind. I now have a
>> couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan X
>> GPU on one system and two Titan X GPUs on the other), and have been
>> thinking about about getting the most out of them. I listened to
>> Carsten's
>> BioExcel webinar this morning and it got me thinking about the cluster
>> as
>> well. I've just had a quick look now and it suggests Nrank = Nc and Nth
>> =
>> 1 for high core count, which I think worked slightly less well for me
>> but
>> I can't find the details so I may be remembering wrong.
>
> That's not unexpected, the reported values are specific to the
> hardware and benchmark systems and only give a rough idea where the
> ranks/threads balance should be.
>>
>> I don't have log files from a systematic benchmark of our cluster as it
>> isn't really available enough for doing that.
>
> That's not really necessary, even logs from a single production run
> can hint possible improvements.
>
>> I haven't tried gmx tune_pme
>> on there either. I do have node-specific installations of gromacs-5.0.4
>> but I think they were done with gcc-4.4.7 so there's room for
>> improvement
>> there.
>
> If that's the case, I'd simply recommend using a modern compiler and
> if you can a recent GROMACS version, you'll gain more performance than
> from most launch config tuning.
>
>> The cluster nodes I have been using have the following cpu specs
>> and 10Gb networking. It could be that using 2 OpenMP threads per MPI
>> rank
>> works nicely because it matches the CPU configuration and makes better
>> use
>> of hyperthreading.
>
> Or because of the network. Or for some other reason. Again, comparing
> the runs' log files could tell more :)
>
>> Architecture:          x86_64
>> CPU op-mode(s):        32-bit, 64-bit
>> Byte Order:            Little Endian
>> CPU(s):                8
>> On-line CPU(s) list:   0-7
>> Thread(s) per core:    2
>> Core(s) per socket:    2
>> Socket(s):             2
>> NUMA node(s):          2
>> Vendor ID:             GenuineIntel
>> CPU family:            6
>> Model:                 26
>> Model name:            Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
>> Stepping:              5
>> CPU MHz:               2393.791
>> BogoMIPS:              4787.24
>> Virtualization:        VT-x
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              8192K
>> NUMA node0 CPU(s):     0,2,4,6
>> NUMA node1 CPU(s):     1,3,5,7
>>
>> I appreciate that a lot is system-dependent and that I can't really help
>> you help me very much. It also should be noted that my multi runs are
>> multiple walker metadynamics run and are slowing down because there are
>> large bias potentials in memory that need to be communicated around too.
>> As I said I haven't had a chance to make separate benchmark runs but
>> have
>> just made observations based upon existing runs.
>
> Understandable, I was just giving tips and hints.
>
> Cheers,
> --
> Sz.
>
>
>> Best wishes
>> James
>>
>>> Performance tuning is highly dependent on the simulation system and
>>> the hardware you're running on. Questions like the ones you pose are
>>> impossible to answer meaningfully without *full* log files (and
>>> hardware specs including network).
>>>
>>> Have you checked the performance checklist I linked above?
>>> --
>>> SzilÃƒÄ„rd
>>>
>>>
>>> On Wed, Sep 21, 2016 at 11:36 AM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes
>>>> from
>>>> using -multi 6 with 8-CPU nodes. That level of parallelism may then be
>>>> necessary to trigger automatic segregation of PP and PME ranks. I'm
>>>> not
>>>> sure if I tried -np 54 and -ntomp 4, which would probably also do it.
>>>> I
>>>> compared mostly on 196 CPUs then found going up to 216 was better than
>>>> 196
>>>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both.
>>>> Would people recommend to go back to 196 which allows 4 whole nodes
>>>> per
>>>> replica and playing with -npme and -ntomp_pme?
>>>>
>>>>> Hi Thanh Le,
>>>>>
>>>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you
>>>>> could
>>>>> try the following
>>>>>
>>>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ...
>>>>> mpirun -np 18 mdrun -ntomp 6 ...
>>>>> mpirun -np 54 mdrun -ntomp 2 ...
>>>>>
>>>>> Which of these works best will depend on your setup.
>>>>>
>>>>> Using the whole cluster for one job may not be the most efficient
>>>>> way.
>>>>> I
>>>>> found on our cluster that once I reach 216 CPUs (equivalent settings
>>>>> from
>>>>> the queuing system to -np 108 and -ntomp 2), I can't do better by
>>>>> adding
>>>>> more nodes (where presumably communication becomes an issue). In
>>>>> addition
>>>>> to running -multi or -multidir jobs, which takes the load off
>>>>> communication a bit, it may also be worth having separate jobs and
>>>>> using
>>>>> -pin on and -pinoffset.
>>>>>
>>>>> Best wishes
>>>>> James
>>>>>
>>>>>> Hi everyone,
>>>>>> I have a question concerning running gromacs in parallel. I have
>>>>>> read
>>>>>> over
>>>>>> the
>>>>>> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
>>>>>> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html>
>>>>>> but I still dont quite understand how to run it efficiently.
>>>>>> My gromacs version is 4.5.4
>>>>>> The cluster I am using has CPUs total: 108 and 4 hosts up.
>>>>>> The node iam using:
>>>>>> Architecture:          x86_64
>>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>>> Byte Order:            Little Endian
>>>>>> CPU(s):                12
>>>>>> On-line CPU(s) list:   0-11
>>>>>> Thread(s) per core:    2
>>>>>> Core(s) per socket:    6
>>>>>> Socket(s):             1
>>>>>> NUMA node(s):          1
>>>>>> Vendor ID:             AuthenticAMD
>>>>>> CPU family:            21
>>>>>> Model:                 2
>>>>>> Stepping:              0
>>>>>> CPU MHz:               1400.000
>>>>>> BogoMIPS:              5200.57
>>>>>> Virtualization:        AMD-V
>>>>>> L1d cache:             16K
>>>>>> L1i cache:             64K
>>>>>> L2 cache:              2048K
>>>>>> L3 cache:              6144K
>>>>>> NUMA node0 CPU(s):     0-11
>>>>>> MPI is already installed. I also have permission to use the cluster
>>>>>> as
>>>>>> much as I can.
>>>>>> My question is: how should I write my mdrun command run to utilize
>>>>>> all
>>>>>> the
>>>>>> possible cores and nodes?
>>>>>> Thanks,
>>>>>> Thanh Le
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>> or
>>>>>> send
>>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a mail to gmx-users-request at gromacs.org.
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send
>>> a mail to gmx-users-request at gromacs.org.
>>
>>
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
> a mail to gmx-users-request at gromacs.org.