[gmx-developers] Following forces with domain decomposition

Tue Jun 9 19:13:51 CEST 2020

Hi All,

I'm trying (once again) to get back into figuring out the lingering bugs 
with the Drude implementation when using domain decomposition. Since I 
last asked for help, I have gotten coordinate and velocity communication 
working properly. Now, I'm stuck on forces. To quickly recap the issue, 
it is possible that Drudes and their parent atoms get separated in 
different domains. This requires communication of coordinates, 
velocities, and forces via treatment as "special atoms" like is the case 
with virtual sites. As such, my implementation largely follows what 
happens for the virtual sites (communicate after any update).

I have been tracing the forces at every step of do_force - basically 
printing out the force on a Drude that I know is in a different domain 
from its parent atom. I use the OpenMP output as reference. I can 
reproduce the OpenMP forces with domain decomposition but no 
communication (e.g. gmx mdrun -ntmpi 2 -npme 1 -deffnm md -nb cpu), 
based on Berk's suggestion from a long time ago. So the issue I'm having 
must be coming from communicating somewhere, but I can't nail it down. 
Here is an example of the output I'm looking at.

First, from OpenMP (my reference, the correct output):

=== Step 0 ===
DO FORCE: top f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 1271.383667 
-3106.622803 2148.540283
DO FORCE: after nbnxn_atomdata_add_nbat_fshift_to_fshift f[54] = 
1271.383667 -3106.622803 2148.540283
DO FORCE: after do_force_lowlevel f[54] = 82.651733 130.833740 82.218506
DO FORCE: b4 move_f f[54] = 82.651733 130.833740 82.218506
DO FORCE: after move_f f[54] = 82.651733 130.833740 82.218506
DO FORCE: after GPU use/emulate f[54] = 82.651733 130.833740 82.218506
DO FORCE: after vsite_spread f[54] = 82.651733 130.833740 82.218506
DO FORCE: b4 post f[54] = 82.651733 130.833740 82.218506
DO FORCE: end f[54] = 58.264297 16.147758 43.956337
=== Step 1 ===
DO FORCE: top f[54] = 58.264297 16.147758 43.956337
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 1205.647705 
-3128.451904 2138.944580
DO FORCE: after nbnxn_atomdata_add_nbat_fshift_to_fshift f[54] = 
1205.647705 -3128.451904 2138.944580
DO FORCE: after do_force_lowlevel f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: b4 move_f f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after move_f f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after GPU use/emulate f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after vsite_spread f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: b4 post f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: end f[54] = 162.370026 -306.717041 -321.102356

Now, my implementation with domain decomposition:

=== Step 0 ===
DO FORCE: top f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 338.912842 
-2940.618164 2357.080078
DO FORCE: after do_force_lowlevel f[54] = 1899.546387 -1663.452881 
1703.655273
DO FORCE: b4 move_f f[54] = 1899.546387 -1663.452881 1703.655273
DO FORCE: after move_f f[54] = 82.647949 130.835449 82.213165
DO FORCE: after GPU use/emulate f[54] = 82.647949 130.835449 82.213165
DO FORCE: after vsite_spread f[54] = 82.647949 130.835449 82.213165
DO FORCE: b4 post f[54] = 82.647949 130.835449 82.213165
DO FORCE: end f[54] = 58.260483 16.149330 43.951458
=== Step 1 ===
DO FORCE: top f[54] = 58.260483 16.149330 43.951458
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 265.444092 
-2965.024170 2346.120117
DO FORCE: after do_force_lowlevel f[54] = 1834.273926 -1685.225830 
1654.119141
DO FORCE: b4 move_f f[54] = 1834.273926 -1685.225830 1654.119141
DO FORCE: after move_f f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: after GPU use/emulate f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: after vsite_spread f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: b4 post f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: end f[54] = 229.446487 -248.274734 -255.144485

 From this output, I can see that communication works in step 0 and 
between steps 0 and 1, since the force is correctly propagated. I also 
do not know to what extent I can expect forces to match before the 
"move_f" step (which is where I communicate non-local Drude forces and 
follows the existing "dd_move_f" in do_force_cutsVERLET). But the forces 
should certainly be the same after communicating so they are correctly 
input to post_process_forces.

Can anyone suggest how the code paths might differ between these two 
steps? I've debugged every step along the way that I can figure out and 
all I can come up with is that the forces end up different. I know that 
may be a big request without seeing the code, but I'm simply determining 
non-local Drudes the same way we do with vsites, and communicating their 
forces with the existing dd_move_f_specat function that vsites also use.

Any help would be greatly appreciated. I've been stuck on this forever 
and it is clear that our user community really wants this feature. I can 
give them OpenMP easily, but that's rather restrictive...

-Justin

-- 
==================================================

Justin A. Lemkul, Ph.D.
Assistant Professor
Office: 301 Fralin Hall
Lab: 303 Engel Hall

Virginia Tech Department of Biochemistry
340 West Campus Dr.
Blacksburg, VA 24061

jalemkul at vt.edu | (540) 231-3129
http://www.thelemkullab.com

==================================================