[gmx-users] Not all bonded interactions have been properly assigned to the domain decomposition cells

Cardenas, Alfredo E alfredo at ices.utexas.edu
Tue Mar 10 21:29:48 CET 2020


Hi,
I am using gromacs-2019.4. I have been running simulations box that contains a peptide embedded in a DOPC bilayer membrane, using all atom simulations. I have been running for weeks in a TACC computer that has 4 gpu in a single node, so I usually run 4 trajectories in a single node using the -multidir option. My submission script is:

#!/bin/bash
#SBATCH -J sb16             # Job name
#SBATCH -o test.o%j              # Job name
#SBATCH -e test.e%j              # Job name
#SBATCH -N 1                  # Total number of nodes requested
#SBATCH -n 4                # Total number of mpi tasks requested
#SBATCH -p rtx # Queue (partition) name -- normal, development, etc.
#SBATCH -t 48:00:00           # Run time (hh:mm:ss) - 1.5 hours
module load cuda/10.1
module use -a /home1/01247/alfredo/Software/ForGPU/plumed-2.5.3/MyInstall/lib/plumed/ModuleFile
module load plumed_gpu
export OMP_NUM_THREADS=4
ibrun /home1/01247/alfredo/Software/gromacs-2019.4_gpu/build-gpu-mpi-plumed/My_install/bin/mdrun_mpi -s topol.tpr -plumed plumed.dat -multidir 1 2 3 4


Because the system is going to be down for a week I want to do continuation runs in a slower computer system, also using gpus. Because the system is slower I want to run it using two nodes. A script that I have used successfully in that old machine is:

#!/bin/bash
#SBATCH -J SB9_pi1             # Job name
#SBATCH -o test.o%j              # Job name
#SBATCH -N 2                  # Total number of nodes requested
#SBATCH -n 2                # Total number of mpi tasks requested
#SBATCH -p gpu # Queue (partition) name -- normal, development, etc.
#SBATCH -t 24:00:00           # Run time (hh:mm:ss) - 1.5 hours
module load gcc/5.2.0
module load cray_mpich/7.7.3
module load cuda/9.0
# Launc hMPI-based executable
export OMP_NUM_THREADS=6
ibrun  /home1/01247/alfredo/gromacs-2019.4/build_MPI/My_install/bin/mdrun_mpi -s topol2.tpr -pin on -cpi state.cpt -noappend

It works great if I setup a new simulation of the same molecular system (create a new tpr file). But if I attempt to run a continuation run coming from the other machine (that used 4 threads). I get

Not all bonded interactions have been properly assigned to the domain decomposition cells
A list of missing interactions:
                Bond of  10801 missing     -5
                 U-B of  53187 missing     22
         Proper Dih. of  89703 missing    119
               LJ-14 of  73729 missing      3
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

And it stops. If I modified the script for the old machine to use 1 nodes, 1 task and 4 thread, it runs well but it is a lot slower.
My question is if there is any way to avoid this error, so I can do a continuation run using state.cpt with a different domain decomposition. I have seen in the list that is suggested to use -rdd. The value printed in the log file is 1.595 nm. I increased to 2.0 and gave a similar error.

Thanks,

Alfredo





More information about the gromacs.org_gmx-users mailing list