[gmx-users] Performance of Gromacs-4.6.1 on BlueGene/Q

Tue Jun 4 17:48:16 CEST 2013

On Tue, Jun 4, 2013 at 4:50 PM, Jianguo Li <ljggmx at yahoo.com.sg> wrote:

>
>
> Thank you, Mark and Xavier.
>
> The thing is that the cluster manager set the
> minimum number of cores of each jobs in Bluegene/Q is 128, so I can not
> use 64 cores. But according to the performance, 512 cores in Bluegene
> roughly equivalent to 64 cores in another cluster. Since there are 16
> cores in each computational cards, the total number of cores I used in
> Bluegene//Q is num_cards times 16. So in my test, I acutally run
> simulations using different number of cards, from 8 to 256.
>
> The following is the script I submitted to bluegene using 128
> computational cards:
>
> #!/bin/sh
> #SBATCH --nodes=128
> # set Use 128 Compute Cards ( 1x Compute Card = 16 cores, 128x16 = 2048
> cores )
> #SBATCH --job-name="128x16x2"
> # set Job name
> #SBATCH -output="first-job-sample"
> # set
> Output file
> #SBATCH --partition="training"
>
>
> srun
> --ntasks-per-node=32 --overcommit
> /scratch/home/biilijg/package/gromacs-461/bin/mdrun -s box_md1.tpr -c
> box_md1.gro -x box_md1.xtc -g md1.log >& job_md1
>
> Since bluegene/q accepts up to 4 tasks each
> core, I used 32 mpi tasks for each card (2 task per core). I tried
> --ntasks-per-node=64, but the simulations get much slower.
> Is there a optimized number for --ntasks-per-node?
>

The threads per core thing will surely be useless for GROMACS. Even our
unoptimized kernels will saturate the available flops. There simply is
nothing to overlap, so you net lose from the extra overhead. You should aim
at 16 threads per node, one for each A2 core. Each of those 16 need not be
an MPI process, however.

There's some general background info here
http://www.gromacs.org/Documentation/Acceleration_and_parallelization.
Relevant to BG/Q is that you will be using real MPI and should use OpenMP
and the Verlet kernels (see
http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Multi-level_parallelization.3a_MPI.2fthread-MPI_.2b_OpenMP).
Finding the right balance of OpenMP threads per MPI process is hardware-
and problem-dependent, so you will need to experiment there.

Mark