[gmx-users] mdrun -rerun does not reproduce itself
Christopher Neale
chris.neale at alum.utoronto.ca
Tue Apr 26 19:44:00 CEST 2016
Dear Users:
I find that running gromacs 5.1.2 mdrun -rerun many times gives different energies for some components. On GPUs I find that its the SR interactions (both LJ and q) that are inconsistent. On CPU only, I find that its the Coul. recip. that is inconsistent. On CPU only with NPME=1 I find consistency but I am surprised still by how much the coulombic SR depends on the approach (see differences in average value between the 3 approaches). Note that the issue here is not that the rerun energies differ from the runtime energies, but that mdrun rerun is not entirely reproducible compared to itself. Also note that differences in the Coul. SR but not the LJ SR changes are reflected in the last column, which is the total potential energy... strange, and possibly off-topic to this post, but I note it in case some of these differences might just be output related. There's also the difference between rows 5 and 6, where none of the potential energy components changes, but the total does change.
### [GPU TEST] Note the differences down column 8 (LJ SR) and column 9 (q SR)
$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/gpu_serial/bin/gmx mdrun -notunepme -dlb yes -npme 0 -cpt 60 -gpu_id 0123 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.625000 4761.907227 -819692.500000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.947266 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.943359 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.750000 4761.907227 -819692.687500
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.687500 4761.907227 -819692.562500
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.945312 -932386.750000 4761.907227 -819692.625000
2000.000000 22256.701172 121366.734375 69411.359375 813.758789 12650.896484 -141897.171875 23329.943359 -932386.625000 4761.907227 -819692.500000
### [CPU TEST] Note the differences down column 9 (Coul. recip.). However, note that Coul. recip. changes are again not reflected in the last column, which is the total potential energy.
$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/serial/bin/gmx mdrun -notunepme -dlb yes -npme 0 -cpt 60 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.906738 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.906738 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
2000.000000 22256.705078 121366.718750 69411.343750 813.758789 12650.895508 -141897.234375 23329.876953 -932379.437500 4761.907227 -819685.437500
### One way that seems to keep things identical is to run on CPUs only, but to define NPME=1 (see below) though it again is inconsistent with NPME=2 (not shown).
$ for((i=1;i<=10;i++)); do ~/exec/GROMACS/exec/gromacs-5.1.2/serial/bin/gmx mdrun -notunepme -dlb yes -npme 1 -cpt 60 -ntmpi 4 -ntomp 6 -rerun TEMP.xtc -s MD2.tpr -deffnm SAME >/dev/null 2>&1; echo "1 2 3 4 5 6 7 8 9 10" | gmx energy -f SAME.edr -o epot_SAME.xvg -xvg none >/dev/null 2>&1; tail -n 1 epot_SAME.xvg; done
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
2000.000000 22256.707031 121366.617188 69411.320312 813.758850 12650.896484 -141897.250000 23329.882812 -932365.500000 4761.785156 -819671.812500
So I guess that the solution to getting reproducible energies from mdrun -rerun is to avoid GPUs and to use NPME=1.
However, I am quite surprised by the magnitude of the variance in the coulombic SR energies depending on the architecture (and these differences are reflected in the total potential energy):
coul SR on GPUs: -932386
coul SR on CPUs with NPME=0: -932379
coul SR on CPUs with NPME=1: -932365
I certainly hope that a 21 kJ/mol difference in potential energy is not actually occurring during runs on CPU vs. GPU (or a 14 kJ/mol difference even staying on CPUs depending on whether NPME=1 or 0).
Hopefully I've just missed something obvious here.
I appreciate any insight.
Thank you,
Chris.
More information about the gromacs.org_gmx-users
mailing list