[gmx-users] Re: MPI tips

Tue Jan 31 18:07:37 CET 2006

> As already noted, it is due to the fact you have very short jobs here.
> And most of the time you are seeing there is the setting up the jobs.
> Doing a scaling check of something that only takes 60 seconds on one CPU
> makes very little sense.

The part that disturbs me most is not so much that it takes longer, but 
that the bottleneck is not obvious.  I've seen "too many small jobs
on compute nodes" before and it doesn't look like this.  When that
happens the master node has one or more processes that are sucking up
the CPU time, typically the NFS server or the master program.  Here
the CPU time goes down on the nodes and there is no compensating
increase on the master node.  It wasn't clear to me what aspect
of this process was rate limiting.

The drop off in CPU time on the compute nodes is not due to the
"short jobs don't show up well in top" effect.  mdrun starts a
process on a compute node and then feeds that process work as
needed.   For the 3rd mdrun in the test script the single node time was
46.4 seconds and the 20 node time was 452s.  For
all of 452s on the compute node a single PID is associated with 
mdrun_mpi and it's crawling along using very little CPU.

Running this last step with the nonMPI version takes about as long
as it does for the 1 or 2 node MPI version.  Then it slows down
dramatically by 4 nodes and it takes about 1.8x longer still at 20
nodes.  It feels to me like here it spends more and more time
waiting for all jobs to finish before doling out the next chunk.
So compute nodes AND the master node spend way too much time in
nonCPU consuming wait states.  In other words, it feels like 
an unbalanced load problem.

That's just a guess though.

> 
> You will see better scaling the longer the job is run by.  Also, the
> larger the system that is being simulated also improves the scaling,
> i.e. larger chunks of the simulation to be distributed.  Check the
> archives for this emailing list, the various scaling effects have been
> discussed in depth and I wouldn't be surprised if it wasn't discussed in
> the manual as well.

Is there a searchable version of the mailing list archived somewhere?
Trawling through postings by hand is an awfully slow way to find anything.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech