[gmx-developers] Following forces with domain decomposition
Justin Lemkul
jalemkul at vt.edu
Tue Jun 9 19:13:51 CEST 2020
Hi All,
I'm trying (once again) to get back into figuring out the lingering bugs
with the Drude implementation when using domain decomposition. Since I
last asked for help, I have gotten coordinate and velocity communication
working properly. Now, I'm stuck on forces. To quickly recap the issue,
it is possible that Drudes and their parent atoms get separated in
different domains. This requires communication of coordinates,
velocities, and forces via treatment as "special atoms" like is the case
with virtual sites. As such, my implementation largely follows what
happens for the virtual sites (communicate after any update).
I have been tracing the forces at every step of do_force - basically
printing out the force on a Drude that I know is in a different domain
from its parent atom. I use the OpenMP output as reference. I can
reproduce the OpenMP forces with domain decomposition but no
communication (e.g. gmx mdrun -ntmpi 2 -npme 1 -deffnm md -nb cpu),
based on Berk's suggestion from a long time ago. So the issue I'm having
must be coming from communicating somewhere, but I can't nail it down.
Here is an example of the output I'm looking at.
First, from OpenMP (my reference, the correct output):
=== Step 0 ===
DO FORCE: top f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 1271.383667
-3106.622803 2148.540283
DO FORCE: after nbnxn_atomdata_add_nbat_fshift_to_fshift f[54] =
1271.383667 -3106.622803 2148.540283
DO FORCE: after do_force_lowlevel f[54] = 82.651733 130.833740 82.218506
DO FORCE: b4 move_f f[54] = 82.651733 130.833740 82.218506
DO FORCE: after move_f f[54] = 82.651733 130.833740 82.218506
DO FORCE: after GPU use/emulate f[54] = 82.651733 130.833740 82.218506
DO FORCE: after vsite_spread f[54] = 82.651733 130.833740 82.218506
DO FORCE: b4 post f[54] = 82.651733 130.833740 82.218506
DO FORCE: end f[54] = 58.264297 16.147758 43.956337
=== Step 1 ===
DO FORCE: top f[54] = 58.264297 16.147758 43.956337
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 1205.647705
-3128.451904 2138.944580
DO FORCE: after nbnxn_atomdata_add_nbat_fshift_to_fshift f[54] =
1205.647705 -3128.451904 2138.944580
DO FORCE: after do_force_lowlevel f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: b4 move_f f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after move_f f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after GPU use/emulate f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: after vsite_spread f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: b4 post f[54] = 200.794189 -175.644287 -279.924072
DO FORCE: end f[54] = 162.370026 -306.717041 -321.102356
Now, my implementation with domain decomposition:
=== Step 0 ===
DO FORCE: top f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 338.912842
-2940.618164 2357.080078
DO FORCE: after do_force_lowlevel f[54] = 1899.546387 -1663.452881
1703.655273
DO FORCE: b4 move_f f[54] = 1899.546387 -1663.452881 1703.655273
DO FORCE: after move_f f[54] = 82.647949 130.835449 82.213165
DO FORCE: after GPU use/emulate f[54] = 82.647949 130.835449 82.213165
DO FORCE: after vsite_spread f[54] = 82.647949 130.835449 82.213165
DO FORCE: b4 post f[54] = 82.647949 130.835449 82.213165
DO FORCE: end f[54] = 58.260483 16.149330 43.951458
=== Step 1 ===
DO FORCE: top f[54] = 58.260483 16.149330 43.951458
DO FORCE: after do_nb_verlet #1 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after do_nb_verlet #2 f[54] = 0.000000 0.000000 0.000000
DO FORCE: after nbnxn_atomdata_add_nbat_f_to_f f[54] = 265.444092
-2965.024170 2346.120117
DO FORCE: after do_force_lowlevel f[54] = 1834.273926 -1685.225830
1654.119141
DO FORCE: b4 move_f f[54] = 1834.273926 -1685.225830 1654.119141
DO FORCE: after move_f f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: after GPU use/emulate f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: after vsite_spread f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: b4 post f[54] = 258.300781 -122.286865 -219.277039
DO FORCE: end f[54] = 229.446487 -248.274734 -255.144485
From this output, I can see that communication works in step 0 and
between steps 0 and 1, since the force is correctly propagated. I also
do not know to what extent I can expect forces to match before the
"move_f" step (which is where I communicate non-local Drude forces and
follows the existing "dd_move_f" in do_force_cutsVERLET). But the forces
should certainly be the same after communicating so they are correctly
input to post_process_forces.
Can anyone suggest how the code paths might differ between these two
steps? I've debugged every step along the way that I can figure out and
all I can come up with is that the forces end up different. I know that
may be a big request without seeing the code, but I'm simply determining
non-local Drudes the same way we do with vsites, and communicating their
forces with the existing dd_move_f_specat function that vsites also use.
Any help would be greatly appreciated. I've been stuck on this forever
and it is clear that our user community really wants this feature. I can
give them OpenMP easily, but that's rather restrictive...
-Justin
--
==================================================
Justin A. Lemkul, Ph.D.
Assistant Professor
Office: 301 Fralin Hall
Lab: 303 Engel Hall
Virginia Tech Department of Biochemistry
340 West Campus Dr.
Blacksburg, VA 24061
jalemkul at vt.edu | (540) 231-3129
http://www.thelemkullab.com
==================================================
More information about the gromacs.org_gmx-developers
mailing list