[gmx-users] Re: How to use multiple nodes, each with 2 CPUs and 3 GPUs

Thu Apr 25 17:19:34 CEST 2013

Dear Szilárd:

Thank you for your assistance. I understand the importance of reading the documentation and I read it about 5 times before I posted to this list. In fact, it's kind of buried in my initial post, but I did run MPI gromacs with mpirun -np 3 the first time and it didn't work.

I have finally realized that my problems are based on a PBS variable. Previously, I was using 
#PBS -l walltime=00:10:00,nodes=2:ppn=12:gpus=3:shared
I chose that because it was what was recommended on the webpage of the new cluster that I am using.

However, I can only get things to work when I use ppn=3 instead:
#PBS -l walltime=00:10:00,nodes=2:ppn=3:gpus=3:shared

That surprises me, because I thought that the mpirun -np option should take care of things, but in any event this has solved my general problem, which was that I could not get it running across multiple nodes at all. Now that I have this, I will take a look at your great suggestions about ranks, which I hadn't had the chance to explore because my #PBS ppn setting was stopping these runs from ever getting going.

Thank you again for your help.

Chris.

-- original message --

Hi,

You should really check out the documentation on how to use mdrun 4.6:
http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Running_simulations

Brief summary: when running on GPUs every domain is assigned to a set
of CPU cores and a GPU, hence you need to start as many PP MPI ranks
per node as the number of GPUs (or pass a PP-GPU mapping manually).

Now, there are some slight complications with the inconvenient
hardware setup of the machines you are using. When the number of cores
is not divisible by the number of GPUs, you'll end up wasting cores.
In your case only 3*5=15 cores per compute node will be used. What
will make things even worse, unless you use "-pin on" (which is the
default behavior *only* if you use all cores in a node), is that mdrun
will not lock threads to cores and will let them be moved around by
the OS which can cause severe performance degradation .

However, you can actually work around these issues and get good
performance by using separate PME ranks. You can just try using 3 PP +
1 PME per compute node with four OpenMP threads each, i.e:
mpirun -np 4*Nnodes mpirun_mpi -npme 1 -ntomp 4
If you are lucky with the PP/PME load, this should work well and even
if you get some PP-PME imbalance, this should hurt performance way
less than the inconvenient 3x5 threads setup.

Cheers,
-- 
Szilárd