[gmx-users] mdrun -rerun does not reproduce itself

Christopher Neale chris.neale at alum.utoronto.ca
Tue Apr 26 19:44:00 CEST 2016


Dear Users:

I find that running gromacs 5.1.2 mdrun -rerun many times gives different energies for some components. On GPUs I find that its the SR interactions (both LJ and q) that are inconsistent. On CPU only, I find that its the Coul. recip. that is inconsistent. On CPU only with NPME=1 I find consistency but I am surprised still by how much the coulombic SR depends on the approach (see differences in average value between the 3 approaches). Note that the issue here is not that the rerun energies differ from the runtime energies, but that mdrun rerun is not entirely reproducible compared to itself. Also note that differences in the Coul. SR but not the LJ SR  changes are reflected in the last column, which is the total potential energy... strange, and possibly off-topic to this post, but I note it in case some of these differences might just be output related. There's also the difference between rows 5 and 6, where none of the potential energy components changes, but the total does change.

### [GPU TEST] Note the differences down column 8 (LJ SR) and column 9 (q SR)

$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/gpu_serial/bin/gmx mdrun -notunepme -dlb yes -npme 0 -cpt 60 -gpu_id 0123 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done

 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.625000  4761.907227  -819692.500000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.947266  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.943359  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.750000  4761.907227  -819692.687500
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.687500  4761.907227  -819692.562500
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.945312  -932386.750000  4761.907227  -819692.625000
 2000.000000  22256.701172  121366.734375  69411.359375  813.758789  12650.896484  -141897.171875  23329.943359  -932386.625000  4761.907227  -819692.500000


### [CPU TEST] Note the differences down column 9 (Coul. recip.). However, note that Coul. recip. changes are again not reflected in the last column, which is the total potential energy.

$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/serial/bin/gmx mdrun -notunepme -dlb yes -npme 0 -cpt 60 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done

 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.906738  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.906738  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500
 2000.000000  22256.705078  121366.718750  69411.343750  813.758789  12650.895508  -141897.234375  23329.876953  -932379.437500  4761.907227  -819685.437500


### One way that seems to keep things identical is to run on CPUs only, but to define NPME=1 (see below) though it again is inconsistent with NPME=2 (not shown).

$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/serial/bin/gmx mdrun -notunepme -dlb yes -npme 1 -cpt 60 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done

 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500
 2000.000000  22256.707031  121366.617188  69411.320312  813.758850  12650.896484  -141897.250000  23329.882812  -932365.500000  4761.785156  -819671.812500

So I guess that the solution to getting reproducible energies from mdrun -rerun is to avoid GPUs and to use NPME=1.

However, I am quite surprised by the magnitude of the variance in the coulombic SR energies depending on the architecture (and these differences are reflected in the total potential energy):

coul SR on GPUs: -932386
coul SR on CPUs  with NPME=0: -932379
coul SR on CPUs with NPME=1: -932365

I certainly hope that a 21 kJ/mol difference in potential energy is not actually occurring during runs on CPU vs. GPU (or a 14 kJ/mol difference even staying on CPUs depending on whether NPME=1 or 0).

Hopefully I've just missed something obvious here.
I appreciate any insight.

Thank you,
Chris.



More information about the gromacs.org_gmx-users mailing list