[gmx-developers] How to distribute charges over parallel nodes

Fri May 6 08:32:03 CEST 2011

On 4/05/2011 7:57 PM, Igor Leontyev wrote:
> Thank you for prompt response.
>
>>> To make partial charges be adjustable according to acting field I 
>>> have introduced modifications to gromacs 4.0.7. The serial (single 
>>> thread) version seems to be ready and I want to implement 
>>> parallelization (with particle decomposition). In my current 
>>> implementation:
>>> - values of mdatoms->chargeA for local atoms are updated in "do_md" 
>>> at the begininig of each timestep;
>>> - 'MPI_Sendrecv' + 'gmx_wait' are used in "do_force" (right after 
>>> the call "move_cgcm") to distribute the new charges over parallel 
>>> nodes.
>>> After this the array mdatoms->chargeA have updated values on all 
>>> nodes. But some problem arises later in "gmx_pme_do" (modification 
>>> free routine) hanging up execution and even PC.
>>>
>>
>> Standard procedure is to use a debugger to see which memory access 
>> from where is problematic. I'm not aware of a free parallel debugger, 
>> however. Bisecting with printf() calls can work...
>
> I debug parallel gromacs by GDB + DDD. The problem, however, appears 
> irregularly somewhere in gmx_pme_do such that I can not locate 
> precisely the problematic line. More specifically, the program is 
> executed ok doing step by step debugging but it might hang in run regime.

That suggests a memory problem. Take your serial code and pass it 
through tools like valgrind. Then try the parallel version, etc. Or use 
a real memory debugger like MemoryScape from TotalView.

>>> Is it possible that source of the problem is in use of 
>>> ('MPI_Sendrecv' + 'gmx_wait') in wrong place of the code?
>>>
>>
>> I doubt it.
>>
>>
>>> Many communications are performed in "gmx_pme_do", e.g. "pmeredist" 
>>> calls 'MPI_Alltoallv' for charge and coordinate redistribution over 
>>> the nodes.
>>>
>>> Is there a particular reason in gromacs code why some communications 
>>> are done by  'MPI_Sendrecv' but other by 'MPI_Alltoallv'? What is 
>>> the right way (or right MPI routine) to distribute the locally 
>>> updated charges over all nodes?
>>>
>>
>> Various parts of the code date from times when different parts of the 
>> MPI standard had implementations of varying quality,
>> and some parts are throwbacks (I gather) to the way very early 
>> versions of GROMACS were designed to communicate on a parallel 
>> machine with ring topology.
>> These days, we should use the collective communication calls rather 
>> than introduce maintenance issues re-implementing wheels.
>
> I am not quite experienced in MPI business. Could you be more specific 
> what are the modern MPI routines? 

Collective communication is not "modern", but modern implementations are 
of high enough quality to suggest their use.

> As for examples, which gmx routines use the modern communications?

Not enough of them. :-) This should get cleaned up in the C++ switch.

Unfortunately, it's not enough just to survey which MPI function is 
called. The current replica-exchange code uses collective communication, 
but does so in a way that is not very scalable. Last time I remember 
looking, the data structures built from the .tpr on the master node were 
passed around a ring with a separate MPI call for multiple bits of data 
structures. That should be done with a packed MPI data type and a 
broadcast, but since it's only done once it's not a big deal...

Mark

>>
>> I can't help with clues on how a PD simulation should distribute such 
>> information, except that there must be a mapping somewhere of 
>> simulation atom to MPI rank that distributed the data in mdatoms 
>> shortly after it was constructed from the .tpr file.
>>
>> Mark
>
> I believe it is taken carry in my modifications.
>
> Igor