[gmx-users] Performance of Gromacs-4.6.1 on BlueGene/Q
mark.j.abraham at gmail.com
Tue Jun 4 22:06:53 CEST 2013
On Tue, Jun 4, 2013 at 5:48 PM, Mark Abraham <mark.j.abraham at gmail.com>wrote:
> On Tue, Jun 4, 2013 at 4:50 PM, Jianguo Li <ljggmx at yahoo.com.sg> wrote:
>> Thank you, Mark and Xavier.
>> The thing is that the cluster manager set the
>> minimum number of cores of each jobs in Bluegene/Q is 128, so I can not
>> use 64 cores. But according to the performance, 512 cores in Bluegene
>> roughly equivalent to 64 cores in another cluster. Since there are 16
>> cores in each computational cards, the total number of cores I used in
>> Bluegene//Q is num_cards times 16. So in my test, I acutally run
>> simulations using different number of cards, from 8 to 256.
>> The following is the script I submitted to bluegene using 128
>> computational cards:
>> #SBATCH --nodes=128
>> # set Use 128 Compute Cards ( 1x Compute Card = 16 cores, 128x16 = 2048
>> cores )
>> #SBATCH --job-name="128x16x2"
>> # set Job name
>> #SBATCH -output="first-job-sample"
>> # set
>> Output file
>> #SBATCH --partition="training"
>> --ntasks-per-node=32 --overcommit
>> /scratch/home/biilijg/package/gromacs-461/bin/mdrun -s box_md1.tpr -c
>> box_md1.gro -x box_md1.xtc -g md1.log >& job_md1
>> Since bluegene/q accepts up to 4 tasks each
>> core, I used 32 mpi tasks for each card (2 task per core). I tried
>> --ntasks-per-node=64, but the simulations get much slower.
>> Is there a optimized number for --ntasks-per-node?
> The threads per core thing will surely be useless for GROMACS. Even our
> unoptimized kernels will saturate the available flops. There simply is
> nothing to overlap, so you net lose from the extra overhead. You should aim
> at 16 threads per node, one for each A2 core. Each of those 16 need not be
> an MPI process, however.
> There's some general background info here
> Relevant to BG/Q is that you will be using real MPI and should use OpenMP
> and the Verlet kernels (see
> Finding the right balance of OpenMP threads per MPI process is hardware-
> and problem-dependent, so you will need to experiment there.
Thought I'd clarify further. A BG/Q node has 16 A2 cores. Some mix of MPI
and OpenMP threads across those will be right for GROMACS. Each core is
capable of running up to four "hardware threads." The processor in each
core can only issue two instructions per cycle, one flop and one non-flop,
but only to two different hardware threads. There is a theoretical speedup
from using more than one hardware thread, since you get to take advantage
of more instruction-issue opportunities. But doing so with more MPI
processes will incur other overhead (e.g. from PME global communication, as
well as pure-MPI overhead). Even if you can map the extra hardware threads
to OpenMP threads, you will only be able to get some fraction of the
speedup depending on available registers and bandwidth from cache (and you
still pay some extra overhead for the OpenMP). How big these effects are
depend whether you are running PME, and which of the kernels you are
actually executing. So it might be worth investigating 2 hardware threads
per core using OpenMP, but don't expect to want to write home about the
More information about the gromacs.org_gmx-users