[gmx-users] Running Gromacs in parallel

Wed Sep 21 19:05:50 CEST 2016

Hi,

On Wed, Sep 21, 2016 at 5:44 PM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
> Hi Szilárd,
>
> Yes I had looked at it but not with our cluster in mind. I now have a
> couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan X
> GPU on one system and two Titan X GPUs on the other), and have been
> thinking about about getting the most out of them. I listened to Carsten's
> BioExcel webinar this morning and it got me thinking about the cluster as
> well. I've just had a quick look now and it suggests Nrank = Nc and Nth =
> 1 for high core count, which I think worked slightly less well for me but
> I can't find the details so I may be remembering wrong.

That's not unexpected, the reported values are specific to the
hardware and benchmark systems and only give a rough idea where the
ranks/threads balance should be.
>
> I don't have log files from a systematic benchmark of our cluster as it
> isn't really available enough for doing that.

That's not really necessary, even logs from a single production run
can hint possible improvements.

> I haven't tried gmx tune_pme
> on there either. I do have node-specific installations of gromacs-5.0.4
> but I think they were done with gcc-4.4.7 so there's room for improvement
> there.

If that's the case, I'd simply recommend using a modern compiler and
if you can a recent GROMACS version, you'll gain more performance than
from most launch config tuning.

> The cluster nodes I have been using have the following cpu specs
> and 10Gb networking. It could be that using 2 OpenMP threads per MPI rank
> works nicely because it matches the CPU configuration and makes better use
> of hyperthreading.

Or because of the network. Or for some other reason. Again, comparing
the runs' log files could tell more :)

> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                8
> On-line CPU(s) list:   0-7
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 26
> Model name:            Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
> Stepping:              5
> CPU MHz:               2393.791
> BogoMIPS:              4787.24
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              8192K
> NUMA node0 CPU(s):     0,2,4,6
> NUMA node1 CPU(s):     1,3,5,7
>
> I appreciate that a lot is system-dependent and that I can't really help
> you help me very much. It also should be noted that my multi runs are
> multiple walker metadynamics run and are slowing down because there are
> large bias potentials in memory that need to be communicated around too.
> As I said I haven't had a chance to make separate benchmark runs but have
> just made observations based upon existing runs.

Understandable, I was just giving tips and hints.

Cheers,
--
Sz.

> Best wishes
> James
>
>> Performance tuning is highly dependent on the simulation system and
>> the hardware you're running on. Questions like the ones you pose are
>> impossible to answer meaningfully without *full* log files (and
>> hardware specs including network).
>>
>> Have you checked the performance checklist I linked above?
>> --
>> SzilÃĄrd
>>
>>
>> On Wed, Sep 21, 2016 at 11:36 AM,  <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes from
>>> using -multi 6 with 8-CPU nodes. That level of parallelism may then be
>>> necessary to trigger automatic segregation of PP and PME ranks. I'm not
>>> sure if I tried -np 54 and -ntomp 4, which would probably also do it. I
>>> compared mostly on 196 CPUs then found going up to 216 was better than
>>> 196
>>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both.
>>> Would people recommend to go back to 196 which allows 4 whole nodes per
>>> replica and playing with -npme and -ntomp_pme?
>>>
>>>> Hi Thanh Le,
>>>>
>>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you
>>>> could
>>>> try the following
>>>>
>>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ...
>>>> mpirun -np 18 mdrun -ntomp 6 ...
>>>> mpirun -np 54 mdrun -ntomp 2 ...
>>>>
>>>> Which of these works best will depend on your setup.
>>>>
>>>> Using the whole cluster for one job may not be the most efficient way.
>>>> I
>>>> found on our cluster that once I reach 216 CPUs (equivalent settings
>>>> from
>>>> the queuing system to -np 108 and -ntomp 2), I can't do better by
>>>> adding
>>>> more nodes (where presumably communication becomes an issue). In
>>>> addition
>>>> to running -multi or -multidir jobs, which takes the load off
>>>> communication a bit, it may also be worth having separate jobs and
>>>> using
>>>> -pin on and -pinoffset.
>>>>
>>>> Best wishes
>>>> James
>>>>
>>>>> Hi everyone,
>>>>> I have a question concerning running gromacs in parallel. I have read
>>>>> over
>>>>> the
>>>>> http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
>>>>> <http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html>
>>>>> but I still dont quite understand how to run it efficiently.
>>>>> My gromacs version is 4.5.4
>>>>> The cluster I am using has CPUs total: 108 and 4 hosts up.
>>>>> The node iam using:
>>>>> Architecture:          x86_64
>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>> Byte Order:            Little Endian
>>>>> CPU(s):                12
>>>>> On-line CPU(s) list:   0-11
>>>>> Thread(s) per core:    2
>>>>> Core(s) per socket:    6
>>>>> Socket(s):             1
>>>>> NUMA node(s):          1
>>>>> Vendor ID:             AuthenticAMD
>>>>> CPU family:            21
>>>>> Model:                 2
>>>>> Stepping:              0
>>>>> CPU MHz:               1400.000
>>>>> BogoMIPS:              5200.57
>>>>> Virtualization:        AMD-V
>>>>> L1d cache:             16K
>>>>> L1i cache:             64K
>>>>> L2 cache:              2048K
>>>>> L3 cache:              6144K
>>>>> NUMA node0 CPU(s):     0-11
>>>>> MPI is already installed. I also have permission to use the cluster as
>>>>> much as I can.
>>>>> My question is: how should I write my mdrun command run to utilize all
>>>>> the
>>>>> possible cores and nodes?
>>>>> Thanks,
>>>>> Thanh Le
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send
>>>>> a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.