[gmx-users] scalability of Gromacs with MPI

Tue Jan 24 08:17:14 CET 2006

>From: Jan Thorbecke <janth at xs4all.nl>
>Reply-To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>To: gmx-users at gromacs.org
>Subject: [gmx-users] scalability of Gromacs with MPI
>Date: Mon, 23 Jan 2006 16:00:59 +0100
>
>
>Dear Users,
>
>At this moment I'm working on a benchmark for Gromacs. The benchmark  is 
>set up to run from 32 to 128 CPU's. The scalability is fine up to  64 CPU's 
>beyond that the code is not scaling anymore (see table  below). What 
>prevents it from scaling are the (ring) communication  parts move_x, and 
>move_f. Those parts together take about 20 s. on  128 CPU's.
>
>CPU's | 3.3 and fftw3 |
>------|---------------|
>32    |  142 s.       |
>64    |   88 s.       |
>128   |   70 s.       |
>
>
>I have no background in Molecular Dynamics and just look at the code  from 
>a performance point of view. My questions are:
>
>- Has anybody scaled Gromacs upto more that 64 CPU's? My guess is  that 
>inherent to the MD problem solved by Gromacs, there is a limit  in the 
>number of processors that could be used efficiently. At some  point the 
>communication of the forces to all other CPU's will  dominate the wallclock 
>time time.
>

No this is not inherent to the MD problem, but inherent to particle
decomposition (as opposed to domain decomposition).

>- I tried to change the ring communication in move_x and move-f to  
>collective communication, but that does not help the scalability.  Does 
>anybody tried other communication schemes?
>

We tried several. But the ring communicaton turned out to the most
efficient, better for instance than dedicated MPI calls.

>- Are there options to try with grompp to set-up a different domain  
>decomposition (for example blocks in x,y,z in stead of lines in x) or  
>other parallelisation strategies?

No, but we are working on domain decomposition.

There is one point where one can improve communication
and that is in gmx_sumf and gmx_sumd.
If one replaces the ring in those calls by MPI_Allreduce
one can get a small performance improvement on many cpu's.

Berk.