[gmx-users] Hyper-threading Gromacs 5.0.1

Thu Sep 11 19:24:06 CEST 2014

Hyperthreading seems to give better performance:

1 MPI thread, 12 Open MP threads.
    command: mdrun_d -deffnm npt2 -ntomp 12 -ntmpi 1 -pin on -pinoffset 0

               Core t (s)   Wall t (s)        (%)
       Time:     6638.388      553.519     1199.3
                 (ns/day)    (hour/ns)
Performance:        3.297        7.280
Finished mdrun on rank 0 Thu Sep 11 06:24:52 2014

24 MPI threads. 1 OpenMP thread per tMPI thread.
    command: mdrun_d -deffnm npt2

NOTE: 12.1 % performance was lost because the PME ranks
      had less work to do than the PP ranks.

               Core t (s)   Wall t (s)        (%)
       Time:     7064.222      294.611     2397.8
                 (ns/day)    (hour/ns)
Performance:        4.036        5.947
Finished mdrun on rank 0 Thu Sep 11 06:39:47 2014

I tried 12 tMPI threads, with 1 omp thread each, and with pin on.
     performance was 3.5 ns/day.

I used to compile gromacs 4.6.5 single precision with intel compiler and
mkl.
This gromacs 5.0.1 double precision was compiled with gcc 4.4.7 because
installing the intel compiler now needs root.

On Thu, Sep 11, 2014 at 9:48 AM, Johnny Lu <johnny.lu128 at gmail.com> wrote:

> this mail list thread talks about it:
> https://www.mail-archive.com/gromacs.org_gmx-users@maillist.sys.kth.se/msg06331.html
>
>
> On Thu, Sep 11, 2014 at 9:45 AM, Johnny Lu <johnny.lu128 at gmail.com> wrote:
>
>> The gromacs wiki also says that mixing mpi and openmp is bad on small
>> computers.
>>
>> On Thu, Sep 11, 2014 at 9:44 AM, Johnny Lu <johnny.lu128 at gmail.com>
>> wrote:
>>
>>> Ah. Thanks a lot.
>>> As suggested by (
>>> https://www.ibm.com/developerworks/community/blogs/brian/entry/linux_show_the_number_of_cpu_cores_on_your_system17?lang=en),
>>>
>>> $ cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l
>>> 2
>>> $ cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed
>>> s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l
>>> 12
>>>
>>> There are 12 real cores.
>>> Type "top" and then press 1 sometimes give double the number of real
>>> cores, but sometimes doesn't double the number (tested on different
>>> machines).
>>>
>>> How to run "an MPI  rank per core" ? By this way? "OMP_NUM_THREADS=12
>>> mdrun" on a 12 core machine?
>>>
>>> I tried openmp threads instead of mpi thread because gromacs wiki says
>>> openmp threads are faster than mpi based parallelization.
>>>
>>> from the gromacs wiki (
>>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Multi-level_parallelization.3a_MPI_and_OpenMP
>>> ):
>>>
>>> In GROMACS 4.6 compiled with thread-MPI, OpenMP-only parallelization is
>>> the default with Verlet scheme when using up to 8 cores on AMD platforms
>>> and up to 12 and 16 cores on Intel Nehalem and Sandy Bridge, respectively.
>>> Note that even running across two CPUs (in different sockets) on Intel
>>> platforms OpenMP mutithreading is, in the majority of the cases,
>>> significantly faster than MPI-based parallelization.
>>>
>>> ...
>>>
>>> Assuming that there are N cores available, the following commands are
>>> equivalent:
>>>
>>> mdrun -ntomp N -ntmpi 1
>>> OMP_NUM_THREADS=N mdrun
>>> mdrun #assuming that N <= 8 on AMD or N <= 12/16 on Intel Nehalem/Sandy Bridge
>>>
>>>
>>>
>>>
>>
>