[gmx-users] Running Gromacs in parallel
jkrieger at mrc-lmb.cam.ac.uk
jkrieger at mrc-lmb.cam.ac.uk
Wed Sep 21 17:44:27 CEST 2016
Yes I had looked at it but not with our cluster in mind. I now have a
couple of GPU systems (both have an 8-core i7-4790K CPU with one Titan X
GPU on one system and two Titan X GPUs on the other), and have been
thinking about about getting the most out of them. I listened to Carsten's
BioExcel webinar this morning and it got me thinking about the cluster as
well. I've just had a quick look now and it suggests Nrank = Nc and Nth =
1 for high core count, which I think worked slightly less well for me but
I can't find the details so I may be remembering wrong.
I don't have log files from a systematic benchmark of our cluster as it
isn't really available enough for doing that. I haven't tried gmx tune_pme
on there either. I do have node-specific installations of gromacs-5.0.4
but I think they were done with gcc-4.4.7 so there's room for improvement
there. The cluster nodes I have been using have the following cpu specs
and 10Gb networking. It could be that using 2 OpenMP threads per MPI rank
works nicely because it matches the CPU configuration and makes better use
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
CPU MHz: 2393.791
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0,2,4,6
NUMA node1 CPU(s): 1,3,5,7
I appreciate that a lot is system-dependent and that I can't really help
you help me very much. It also should be noted that my multi runs are
multiple walker metadynamics run and are slowing down because there are
large bias potentials in memory that need to be communicated around too.
As I said I haven't had a chance to make separate benchmark runs but have
just made observations based upon existing runs.
> Performance tuning is highly dependent on the simulation system and
> the hardware you're running on. Questions like the ones you pose are
> impossible to answer meaningfully without *full* log files (and
> hardware specs including network).
> Have you checked the performance checklist I linked above?
> On Wed, Sep 21, 2016 at 11:36 AM, <jkrieger at mrc-lmb.cam.ac.uk> wrote:
>> I wonder whether what I see that -np 108 and -ntomp 2 is best comes from
>> using -multi 6 with 8-CPU nodes. That level of parallelism may then be
>> necessary to trigger automatic segregation of PP and PME ranks. I'm not
>> sure if I tried -np 54 and -ntomp 4, which would probably also do it. I
>> compared mostly on 196 CPUs then found going up to 216 was better than
>> with -ntomp 2 and pure MPI (-ntomp 1) was considerably worse for both.
>> Would people recommend to go back to 196 which allows 4 whole nodes per
>> replica and playing with -npme and -ntomp_pme?
>>> Hi Thanh Le,
>>> Assuming all the nodes are the same (9 nodes with 12 CPUs) then you
>>> try the following
>>> mpirun -np 9 --map-by node mdrun -ntomp 12 ...
>>> mpirun -np 18 mdrun -ntomp 6 ...
>>> mpirun -np 54 mdrun -ntomp 2 ...
>>> Which of these works best will depend on your setup.
>>> Using the whole cluster for one job may not be the most efficient way.
>>> found on our cluster that once I reach 216 CPUs (equivalent settings
>>> the queuing system to -np 108 and -ntomp 2), I can't do better by
>>> more nodes (where presumably communication becomes an issue). In
>>> to running -multi or -multidir jobs, which takes the load off
>>> communication a bit, it may also be worth having separate jobs and
>>> -pin on and -pinoffset.
>>> Best wishes
>>>> Hi everyone,
>>>> I have a question concerning running gromacs in parallel. I have read
>>>> but I still dont quite understand how to run it efficiently.
>>>> My gromacs version is 4.5.4
>>>> The cluster I am using has CPUs total: 108 and 4 hosts up.
>>>> The node iam using:
>>>> Architecture: x86_64
>>>> CPU op-mode(s): 32-bit, 64-bit
>>>> Byte Order: Little Endian
>>>> CPU(s): 12
>>>> On-line CPU(s) list: 0-11
>>>> Thread(s) per core: 2
>>>> Core(s) per socket: 6
>>>> Socket(s): 1
>>>> NUMA node(s): 1
>>>> Vendor ID: AuthenticAMD
>>>> CPU family: 21
>>>> Model: 2
>>>> Stepping: 0
>>>> CPU MHz: 1400.000
>>>> BogoMIPS: 5200.57
>>>> Virtualization: AMD-V
>>>> L1d cache: 16K
>>>> L1i cache: 64K
>>>> L2 cache: 2048K
>>>> L3 cache: 6144K
>>>> NUMA node0 CPU(s): 0-11
>>>> MPI is already installed. I also have permission to use the cluster as
>>>> much as I can.
>>>> My question is: how should I write my mdrun command run to utilize all
>>>> possible cores and nodes?
>>>> Thanh Le
>>>> Gromacs Users mailing list
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> a mail to gmx-users-request at gromacs.org.
>> Gromacs Users mailing list
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
> Gromacs Users mailing list
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
> a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users