[gmx-developers] Possible impact on load balancing
Matthieu.Dreher at imag.fr
Matthieu.Dreher at imag.fr
Wed May 2 14:10:16 CEST 2012
Hi all,
I am currently working on the integration of Gromacs 4.5 with a
middleware (FlowVR) design to enable modular programming in distribute
context. In our application, Gromacs is a module which produce data
(the positions of the atoms) for the other module.
To do so, we added two functionalities. The first one is the
construction of a message on each node which include the positions of
the local atoms and send them on the network. The second (wait()
function) is a mandatory function of the middleware we use and is the
first function call in the while(!bLastStep) loop.
To resume, we have something like this :
while(!bLastStep){
wait(); //Mandatory for FlowVR
... //Proceed first steps of Gromacs
/*********
* output sections (write_traj, etc...)
**********/
build_and_send_pos() //Copy of the home atoms in a buffer and
send it on the network
..... //Proceed second steps of Gromacs
}
In our first observations, we found that the construction and send of
the messages has a very low and stable cost but the wait() function
has a more "random" cost at least in time cost. The relative cost of
this function can vary from 1 to 10 on some iteration and can
represent up to 10% of the time computation of a step.
We test our system with different large molecular systems (100K -> 1M7
atoms) with and without our middleware to evaluate the cost of our
middleware. We found that with a small number of cores (~50), the
difference of performances of gromacs between with or without the
middleware can be explain by the two functionnalities we have
introduced into Gromacs. But when the number of cores increase, the
impact on the performance become very important (-30% with ~100 core
and -50% with ~400 cores). Our cluster is composed of 2x Xeon quadcore
on each machine with an infiniband connection between each machine.
When visualizing the traces, we saw that the wait() function become
very unstable and can cost 10% of an iteration time. Yet this is not
sufficient to explain the important performance loss.
I do have an hypothesis to explain at least a part of the drop of
performances and I would like to know if it is possible for you (or
not).
According to the Gromacs 4 article, the load balancing mechanism is
based on timing. I was wondering if somehow gromacs could see the
wait() operation in his timings and, as the wait() can vary greatly on
the same iteration for each node, gromacs could see this as an
imbalance problem which it tries to rebalance by redistributing the
atoms (which wouldn't be good because it's not an atoms related
imbalance) which would cause more imbalance etc causing a snowball
effect.
I'm probably not very clear on some points, feel free to ask more
details on precise points.
Thank you for your time.
Dreher Matthieu
More information about the gromacs.org_gmx-developers
mailing list