[gmx-users] continuation run segmentation fault

Szilárd Páll pall.szilard at gmail.com
Fri Jul 25 00:19:24 CEST 2014


Hi,

There is a certain version of MPI that caused a lot of headache until
we realized that it is buggy. I'm not entirely sure what version was
it, but I suspect it was the 1.4.3 shipped as default on Ubuntu 12.04
server.

I suggest that you try:
- using a different MPI version;
- using a single rank/no MPI to continue;
- using thread-MPI to continue;

Cheers,
--
Szilárd


On Thu, Jul 24, 2014 at 5:29 PM, David de Sancho
<daviddesancho at gmail.com> wrote:
> Dear all
> I am having some trouble continuing some runs with Gromacs 4.5.5 in our
> local cluster. Surprisingly, the simulations run smoothly in the same
> number of nodes and cores before in the same system. And even more
> surprisingly if I reduce the number of nodes to 1 with its 12 processors,
> then it runs again.
>
> And the script I am using to run the simulations looks something like this@
>
> # Set some Torque options: class name and max time for the job. Torque
>> developed from a program called
>> # OpenPBS, hence all the PBS references in this file
>> #PBS -l nodes=4:ppn=12,walltime=24:00:00
>
> source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash
>> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel"
>> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename"
>>
>> #! change the working directory (default is home directory)
>> cd $PBS_O_WORKDIR
>> echo Running on host `hostname`
>> echo Time is `date`
>> echo Directory is `pwd`
>> echo PBS job ID is $PBS_JOBID
>> echo This jobs runs on the following machines:
>> echo `cat $PBS_NODEFILE | uniq`
>> #! Run the parallel MPI executable
>> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64"
>> echo "Running mpiexec $application $options"
>> mpiexec $application $options
>
>
> And the error messages I am getting look something like this
>
>> [compute-0-11:09645] *** Process received signal ***
>> [compute-0-11:09645] Signal: Segmentation fault (11)
>> [compute-0-11:09645] Signal code: Address not mapped (1)
>> [compute-0-11:09645] Failing at address: 0x10
>> [compute-0-11:09643] *** Process received signal ***
>> [compute-0-11:09643] Signal: Segmentation fault (11)
>> [compute-0-11:09643] Signal code: Address not mapped (1)
>> [compute-0-11:09643] Failing at address: 0xd0
>> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
>> [compute-0-11:09645] [ 1]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2af2091443f9]
>> [compute-0-11:09645] [ 2]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2af209142963]
>> [compute-0-11:09645] [ 3]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
>> [0x2af20996e33c]
>> [compute-0-11:09645] [ 4]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
>> [0x2af20572cfa7]
>> [compute-0-11:09645] [ 5]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
>> [0x2af205219636]
>> [compute-0-11:09645] [ 6]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa2259b]
>> [compute-0-11:09645] [ 7]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa2a04b]
>> [compute-0-11:09645] [ 8]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa22da9]
>> [compute-0-11:09645] [ 9]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
>> [0x2af205204dcc]
>> [compute-0-11:09645] [10]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
>> [0x2af205236f0c]
>> [compute-0-11:09645] [11]
>> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
>> [0x2af204b8ba6b]
>> [compute-0-11:09645] [12]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
>> [0x415aac]
>> [compute-0-11:09645] [13]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
>> [0x41d968]
>> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x38d281d994]
>> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
>> [compute-0-11:09643] [ 1]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2b56aca403f9]
>> [compute-0-11:09643] [ 2]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2b56aca3e963]
>> [compute-0-11:09643] [ 3]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
>> [0x2b56ad26a33c]
>> [compute-0-11:09643] [ 4]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
>> [0x2b56a9028fa7]
>> [compute-0-11:09643] [ 5]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
>> [0x2b56a8b15636]
>> [compute-0-11:09643] [ 6]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2b56ae31e59b]
>> [compute-0-11:09643] [ 7]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2b56ae32604b]
>> [compute-0-11:09643] [ 8]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2b56ae31eda9]
>> [compute-0-11:09643] [ 9]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
>> [0x2b56a8b00dcc]
>> [compute-0-11:09643] [10]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
>> [0x2b56a8b32f0c]
>> [compute-0-11:09643] [11]
>> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
>> [0x2b56a8487a6b]
>> [compute-0-11:09643] [12]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
>> [0x415aac]
>> [compute-0-11:09643] [13]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
>> [0x41d968]
>> [compute-0-11:09643] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x38d281d994]
>> [compute-0-11:09643] [15]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189)
>> [0x407449]
>> [compute-0-11:09643] *** End of error message ***
>> [compute-0-11:09645] [15]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189)
>> [0x407449]
>> [compute-0-11:09645] *** End of error message ***
>> [compute-0-13.local][[30524,1],19][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
>> recv(15) failed: Connection reset by peer (104)
>> [compute-0-13.local][[30524,1],17][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
>> recv(15) failed: Connection reset by peer (104)
>> [compute-0-12.local][[30524,1],29][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking]
>> recv(15) failed: Connection reset by peer (104)
>
>
> A number of checks have been carried out. The continuation runs crash right
> away. The segfaults have ocurred in two different nodes, so bad compute
> nodes are probably ruled out. The MPI library is working fine on a number
> of test programs. There are no signs of system problems. On the other hand
> Signal 11 means trying to access memory that the computer thinks I should
> not have access to.
>
> Any ideas on what may be going wrong?
>
> Thanks
>
>
> David
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list