[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Erik Lindahl
Mon May 28 08:57:22 CEST 2007


On May 28, 2007, at 1:59 AM, Trevor Marshall wrote:

> Erik,
> I also have older systems which use Opteron 165 CPUs. I have run  
> tests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2  
> Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show  
> the Core2 duos outperforming the Opterons by a factor of two.

Yes, but are those AutoDock jobs MPI-parallel or just multiple  
independent scalar jobs not communicating between the cores?

Gromacs also provides beautiful performance (close to 100% scaling)  
if you run e.g. 8 independent jobs on a dual quad-core box.

> The data I posted showed inconsistencies which have nothing to do  
> with memory bandwidth, and I was rather hoping for an analysis  
> based upon the manner in which GROMACS mdrun distributes its  
> computing tasks.

Gromacs isn't doing the distribution. That's entirely up to the MPI  
library and the OS.

> I don't believe my data shows memory bandwidth-limiting effects.  
> For example, three 'local' CPUs on the quad core are faster  
> (6.65Gflops) than one of the Quads (5.02 Gflops) and two from the  
> cluster. How does that support the memory bandwidth hypothesis?

As far as I understand you're using gigabit ethernet. Even with Gamma  
that's going to be way higher latency and lower bandwidth compared to  
the shared memory communication on a quad-core machine.

> I figured that it might be possible that the GAMMA MP software is  
> causing overhead, but when I examined the distribution of tasks by  
> GROMACS (in the log I provided) it would seem that the tasks which  
> mdrun distributed to GAMMA actually were distributed well, but that  
> that the manner in which CPU0 hogged most of the mdrun calculations  
> might be a bottleneck. It was insight into GROMACS' mdrun  
> distribution methodology which I was seeking. Is there any  
> quantitative data available for me to review?

If you're interested in comparing the scaling performance of quad- 
core compared to other hardware I would start with the benchmarks on  
the www site.

If it's about getting the highest possible performance you could  
either play with the "-load" option to grompp, or check out the CVS  
development tree with full domain decomposition and dynamic load  
balance implemented (warning, there could still be bugs).



