[gmx-developers] Possible impact on load balancing

Matthieu.Dreher at imag.fr Matthieu.Dreher at imag.fr
Wed May 2 14:10:16 CEST 2012


Hi all,

I am currently working on the integration of Gromacs 4.5 with a  
middleware (FlowVR) design to enable modular programming in distribute  
context. In our application, Gromacs is a module which produce data  
(the positions of the atoms) for the other module.

To do so, we added two functionalities. The first one is the  
construction of a message on each node which include the positions of  
the local atoms and send them on the network. The second (wait()  
function) is a mandatory function of the middleware we use and is the  
first function call in the while(!bLastStep) loop.

To resume, we have something like this :
while(!bLastStep){
      wait(); //Mandatory for FlowVR
      ... //Proceed first steps of Gromacs
      /*********
      *  output sections (write_traj, etc...)
      **********/
      build_and_send_pos() //Copy of the home atoms in a buffer and  
send it on the network
      ..... //Proceed second steps of Gromacs
}

In our first observations, we found that the construction and send of  
the messages has a very low and stable cost but the wait() function  
has a more "random" cost at least in time cost. The relative cost of  
this function can vary from 1 to 10 on some iteration and can  
represent up to 10% of the time computation of a step.

We test our system with different large molecular systems (100K -> 1M7  
atoms) with and without our middleware to evaluate the cost of our  
middleware. We found that with a small number of cores (~50), the  
difference of performances of gromacs between with or without the  
middleware can be explain by the two functionnalities we have  
introduced into Gromacs. But when the number of cores increase, the  
impact on the performance become very important (-30% with ~100 core  
and -50% with ~400 cores). Our cluster is composed of 2x Xeon quadcore  
on each machine with an infiniband connection between each machine.

When visualizing the traces, we saw that the wait() function become  
very unstable and can cost 10% of an iteration time. Yet this is not  
sufficient to explain the important performance loss.
I do have an hypothesis to explain at least a part of the drop of  
performances and I would like to know if it is possible for you (or  
not).
According to the Gromacs 4 article, the load balancing mechanism is  
based on timing. I was wondering if somehow gromacs could see the  
wait() operation in his timings and, as the wait() can vary greatly on  
the same iteration for each node, gromacs could see this as an  
imbalance problem which it tries to rebalance by redistributing the  
atoms (which wouldn't be good because it's not an atoms related  
imbalance) which would cause more imbalance etc causing a snowball  
effect.

I'm probably not very clear on some points, feel free to ask more  
details on precise points.

Thank you for your time.

Dreher Matthieu



More information about the gromacs.org_gmx-developers mailing list