[gmx-users] continuation run segmentation fault

David de Sancho daviddesancho at gmail.com
Thu Jul 24 17:29:56 CEST 2014


Dear all
I am having some trouble continuing some runs with Gromacs 4.5.5 in our
local cluster. Surprisingly, the simulations run smoothly in the same
number of nodes and cores before in the same system. And even more
surprisingly if I reduce the number of nodes to 1 with its 12 processors,
then it runs again.

And the script I am using to run the simulations looks something like this@

# Set some Torque options: class name and max time for the job. Torque
> developed from a program called
> # OpenPBS, hence all the PBS references in this file
> #PBS -l nodes=4:ppn=12,walltime=24:00:00

source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash
> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel"
> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename"
>
> #! change the working directory (default is home directory)
> cd $PBS_O_WORKDIR
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
> echo PBS job ID is $PBS_JOBID
> echo This jobs runs on the following machines:
> echo `cat $PBS_NODEFILE | uniq`
> #! Run the parallel MPI executable
> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64"
> echo "Running mpiexec $application $options"
> mpiexec $application $options


And the error messages I am getting look something like this

> [compute-0-11:09645] *** Process received signal ***
> [compute-0-11:09645] Signal: Segmentation fault (11)
> [compute-0-11:09645] Signal code: Address not mapped (1)
> [compute-0-11:09645] Failing at address: 0x10
> [compute-0-11:09643] *** Process received signal ***
> [compute-0-11:09643] Signal: Segmentation fault (11)
> [compute-0-11:09643] Signal code: Address not mapped (1)
> [compute-0-11:09643] Failing at address: 0xd0
> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
> [compute-0-11:09645] [ 1]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2af2091443f9]
> [compute-0-11:09645] [ 2]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2af209142963]
> [compute-0-11:09645] [ 3]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
> [0x2af20996e33c]
> [compute-0-11:09645] [ 4]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
> [0x2af20572cfa7]
> [compute-0-11:09645] [ 5]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
> [0x2af205219636]
> [compute-0-11:09645] [ 6]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa2259b]
> [compute-0-11:09645] [ 7]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa2a04b]
> [compute-0-11:09645] [ 8]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa22da9]
> [compute-0-11:09645] [ 9]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
> [0x2af205204dcc]
> [compute-0-11:09645] [10]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
> [0x2af205236f0c]
> [compute-0-11:09645] [11]
> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
> [0x2af204b8ba6b]
> [compute-0-11:09645] [12]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
> [0x415aac]
> [compute-0-11:09645] [13]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
> [0x41d968]
> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x38d281d994]
> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
> [compute-0-11:09643] [ 1]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2b56aca403f9]
> [compute-0-11:09643] [ 2]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2b56aca3e963]
> [compute-0-11:09643] [ 3]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
> [0x2b56ad26a33c]
> [compute-0-11:09643] [ 4]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
> [0x2b56a9028fa7]
> [compute-0-11:09643] [ 5]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
> [0x2b56a8b15636]
> [compute-0-11:09643] [ 6]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae31e59b]
> [compute-0-11:09643] [ 7]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae32604b]
> [compute-0-11:09643] [ 8]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae31eda9]
> [compute-0-11:09643] [ 9]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
> [0x2b56a8b00dcc]
> [compute-0-11:09643] [10]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
> [0x2b56a8b32f0c]
> [compute-0-11:09643] [11]
> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
> [0x2b56a8487a6b]
> [compute-0-11:09643] [12]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
> [0x415aac]
> [compute-0-11:09643] [13]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
> [0x41d968]
> [compute-0-11:09643] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x38d281d994]
> [compute-0-11:09643] [15]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189)
> [0x407449]
> [compute-0-11:09643] *** End of error message ***
> [compute-0-11:09645] [15]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189)
> [0x407449]
> [compute-0-11:09645] *** End of error message ***
> [compute-0-13.local][[30524,1],19][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
> recv(15) failed: Connection reset by peer (104)
> [compute-0-13.local][[30524,1],17][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
> recv(15) failed: Connection reset by peer (104)
> [compute-0-12.local][[30524,1],29][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
> recv(15) failed: Connection reset by peer (104)


A number of checks have been carried out. The continuation runs crash right
away. The segfaults have ocurred in two different nodes, so bad compute
nodes are probably ruled out. The MPI library is working fine on a number
of test programs. There are no signs of system problems. On the other hand
Signal 11 means trying to access memory that the computer thinks I should
not have access to.

Any ideas on what may be going wrong?

Thanks


David


More information about the gromacs.org_gmx-users mailing list