Subject: Re: Re: [gmx-users] Gromacs 4 bug?

Berk Hess gmx3 at hotmail.com
Tue Jan 13 11:29:30 CET 2009


Hi,

This is strange.
You run on 4 nodes and all processes hang at the same MPI call.
I see no reason why they should hang if they are all at the correct call.

After how many steps does this happen?
If it is not much I can try to see if it also hangs on our system.
Otherwise, could you try to generate a checkpoint file with
which it hangs quickly?

What version of MPI are you using?

Berk


> Date: Tue, 13 Jan 2009 10:53:25 +0100
> From: patrick.fuchs at univ-paris-diderot.fr
> To: gmx-users at gromacs.org
> Subject: Re: Subject: Re: Re: [gmx-users] Gromacs 4 bug?
> 
> Hi Berk,
> I did a test on gromacs-4.0.2 under Fedora 10 (with fftw-3.0.1 and 
> lam-7.1.4), using a slightly upgraded version of gcc compared to my 
> previous post (gcc version 4.3.2 20081105 (Red hat 4.3.2-7)) on the same 
> hardware but it still hangs (so both FC9 and FC10 give the same problem, 
> while FC8 does not). Finally I could test mdrun_mpi in the debugger and 
> here are the results of my tests. You were right, it seems that mdrun 
> hangs at an MPI call, here are the outputs of each xterm:
> 
> XTERM1
> ===================================================================
> GNU gdb Fedora (6.8-29.fc10)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (gdb) run
> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi
> [Thread debugging using libthread_db enabled]
> [New Thread 0x12df30 (LWP 8285)]
> NNODES=4, MYRANK=0, HOSTNAME=cumin.dsimb.inserm.fr
> NODEID=0 argc=1
>                           :-)  G  R  O  M  A  C  S  (-:
> 
>                 Giant Rising Ordinary Mutants for A Clerical Setup
> 
>                              :-)  VERSION 4.0.2  (-:
> 
> [snip]
> 
> starting mdrun 'Pure DLPC bilayer with 128 lipids and 3655 SPC water'
> 5000000 steps,  10000.0 ps.
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install 
> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 
> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 
> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 
> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64
> (gdb) where
> #0  0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6
> #1  0x0000000000770c83 in lam_ssi_rpi_usysv_proc_read_env ()
> #2  0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()
> #3  0x000000000074a1e0 in _mpi_req_advance ()
> #4  0x000000000073ced0 in lam_send ()
> #5  0x000000000075328e in MPI_Send ()
> #6  0x000000000074d7ec in MPI_Sendrecv ()
> #7  0x00000000004aebfd in gmx_sum_qgrid_dd ()
> #8  0x00000000004b40bb in gmx_pme_do ()
> #9  0x0000000000479a58 in do_force_lowlevel ()
> #10 0x00000000004d1d32 in do_force ()
> #11 0x00000000004214d2 in do_md ()
> #12 0x000000000041bea0 in mdrunner ()
> #13 0x0000000000422b94 in main ()
> (gdb)
> ===================================================================
> 
> 
> XTERM2
> ===================================================================
> GNU gdb Fedora (6.8-29.fc10)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (gdb) run
> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi
> [Thread debugging using libthread_db enabled]
> [New Thread 0x12df30 (LWP 8294)]
> NNODES=4, MYRANK=1, HOSTNAME=cumin.dsimb.inserm.fr
> NODEID=1 argc=1
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install 
> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 
> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 
> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 
> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64
> (gdb) where
> #0  0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6
> #1  0x0000000000770c83 in lam_ssi_rpi_usysv_proc_read_env ()
> #2  0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()
> #3  0x000000000074a1e0 in _mpi_req_advance ()
> #4  0x000000000073ea90 in MPI_Wait ()
> #5  0x000000000074d800 in MPI_Sendrecv ()
> #6  0x00000000004aed44 in gmx_sum_qgrid_dd ()
> #7  0x00000000004b40bb in gmx_pme_do ()
> #8  0x0000000000479a58 in do_force_lowlevel ()
> #9  0x00000000004d1d32 in do_force ()
> #10 0x00000000004214d2 in do_md ()
> #11 0x000000000041bea0 in mdrunner ()
> #12 0x0000000000422b94 in main ()
> (gdb)
> ===================================================================
> 
> 
> XTERM3
> ===================================================================
> GNU gdb Fedora (6.8-29.fc10)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (gdb) run
> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi
> [Thread debugging using libthread_db enabled]
> [New Thread 0x12df30 (LWP 8276)]
> NNODES=4, MYRANK=2, HOSTNAME=cumin.dsimb.inserm.fr
> NODEID=2 argc=1
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()
> Missing separate debuginfos, use: debuginfo-install 
> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 
> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 
> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 
> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64
> (gdb) where
> #0  0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()
> #1  0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()
> #2  0x000000000074a1e0 in _mpi_req_advance ()
> #3  0x000000000073ced0 in lam_send ()
> #4  0x000000000075328e in MPI_Send ()
> #5  0x000000000074d7ec in MPI_Sendrecv ()
> #6  0x00000000004aed44 in gmx_sum_qgrid_dd ()
> #7  0x00000000004b40bb in gmx_pme_do ()
> #8  0x0000000000479a58 in do_force_lowlevel ()
> #9  0x00000000004d1d32 in do_force ()
> #10 0x00000000004214d2 in do_md ()
> #11 0x000000000041bea0 in mdrunner ()
> #12 0x0000000000422b94 in main ()
> (gdb)
> ===================================================================
> 
> 
> XTERM4
> ===================================================================
> GNU gdb Fedora (6.8-29.fc10)
> Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (gdb) run
> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi
> [Thread debugging using libthread_db enabled]
> [New Thread 0x12df30 (LWP 8267)]
> NNODES=4, MYRANK=3, HOSTNAME=cumin.dsimb.inserm.fr
> NODEID=3 argc=1
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()
> Missing separate debuginfos, use: debuginfo-install 
> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 
> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 
> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 
> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64
> (gdb) where
> #0  0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()
> #1  0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()
> #2  0x000000000074a1e0 in _mpi_req_advance ()
> #3  0x000000000073ea90 in MPI_Wait ()
> #4  0x000000000074d800 in MPI_Sendrecv ()
> #5  0x00000000004aebfd in gmx_sum_qgrid_dd ()
> #6  0x00000000004b40bb in gmx_pme_do ()
> #7  0x0000000000479a58 in do_force_lowlevel ()
> #8  0x00000000004d1d32 in do_force ()
> #9  0x00000000004214d2 in do_md ()
> #10 0x000000000041bea0 in mdrunner ()
> #11 0x0000000000422b94 in main ()
> (gdb)
> ===================================================================
> 
> 
> Cheers,
> 
> Patrick
> 


_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20090113/6bd97c60/attachment.html>


More information about the gromacs.org_gmx-users mailing list