[gmx-developers] Gromacs 2016.3 (and earlier) freezing up.

Åke Sandgren ake.sandgren at hpc2n.umu.se
Mon Sep 11 16:42:19 CEST 2017


My debugger run finally got to the lockup.

All processes are waiting on various MPI operations.

Attached a stack dump of all 56 tasks.

I'll keep the debug session running for a while in case anyone wants
some more detailed data.
This is a RelwithDeb build though so not everything is available.

On 09/08/2017 11:28 AM, Berk Hess wrote:
> But you should be able to get some (limited) information by attaching a
> debugger to an aldready running process with a release build.
> 
> If you plan on compiling and running a new case, use a release + debug
> symbols build. That should run as fast as a release build.
> 
> Cheers,
> 
> Berk
> 
> On 2017-09-08 11:23, Åke Sandgren wrote:
>> We have, at least, one case that when run over 2 nodes, or more, quite
>> often (always) hangs, i.e. no more output in md.log or otherwise while
>> mdrun still consumes cpu time. It takes a random time before it happens,
>> like 1-3 days.
>>
>> The case can be shared if someone else wants to investigate. I'm
>> planning to run it in the debugger to be able to break and look at
>> states when it happens, but since it takes so long with the production
>> build it is not something i'm looking forward to.
>>
>> On 09/08/2017 11:13 AM, Berk Hess wrote:
>>> Hi,
>>>
>>> We are far behind schedule for the 2017 release. We are working hard on
>>> it, but I don't think we can promise a date yet.
>>>
>>> We have a 2016.4 release planned for this week (might slip to next
>>> week). But if you can give us enough details to track down your hanging
>>> issue, we might be able to fix it in 2016.4.
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
-------------- next part --------------
Processes,Threads,Function
56,56,main (gmx.cpp:60)
56,56,  gmx::CommandLineModuleManager::run (cmdlinemodulemanager.cpp:583)
56,56,    gmx_mdrun (mdrun.cpp:549)
40,40,      gmx::mdrunner (runner.cpp:1341)
14,14,        gmx::do_md (md.cpp:1097)
14,14,          do_force (sim_util.cpp:1975)
5,5,            do_force_cutsVERLET (sim_util.cpp:1091)
5,5,              dd_move_x (domdec.cpp:469)
5,5,                dd_sendrecv_rvec (domdec_network.cpp:141)
5,5,                  PMPI_Sendrecv
5,5,                    mca_pml_ob1_send
5,5,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
1,1,                        opal_progress
1,1,                          opal_libevent2022_event_base_loop
1,1,                            mca_btl_tcp_component_recv_handler
1,1,                              mca_btl_tcp_proc_lookup
1,1,                                mca_btl_tcp_add_procs
1,1,                                  mca_btl_tcp_proc_create
1,1,                                    mca_btl_tcp_proc_destruct
1,1,                                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1,                                        __lll_lock_wait (lowlevellock.S:135)
9,9,            do_force_cutsVERLET (sim_util.cpp:1380)
9,9,              dd_move_f (domdec.cpp:548)
9,9,                dd_sendrecv_rvec (domdec_network.cpp:141)
9,9,                  PMPI_Sendrecv
5,5,                    mca_pml_ob1_send
5,5,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
1,1,                        opal_progress
1,1,                          opal_libevent2022_event_base_loop
1,1,                            mca_btl_tcp_component_recv_handler
1,1,                              mca_btl_tcp_proc_lookup
1,1,                                mca_btl_tcp_add_procs
1,1,                                  mca_btl_tcp_proc_create
1,1,                                    mca_btl_tcp_proc_destruct
1,1,                                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1,                                        __lll_lock_wait (lowlevellock.S:135)
4,4,                    ompi_request_default_wait
4,4,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
26,26,        gmx::do_md (md.cpp:1437)
26,26,          update_constraints (update.cpp:1359)
16,16,            constrain (constr.cpp:372)
16,16,              dd_move_x_constraints (domdec_constraints.cpp:96)
16,16,                dd_move_x_specat (domdec_specatomcomm.cpp:274)
16,16,                  dd_sendrecv2_rvec (domdec_network.cpp:205)
16,16,                    PMPI_Waitall
16,16,                      ompi_request_default_wait_all
16,16,                        sync_wait_mt
10,10,                          sched_yield (syscall-template.S:84)
6,6,                          opal_progress
5,5,                            opal_timer_base_get_usec_sys_timer
1,1,                            mca_btl_sm_component_progress
10,10,            constrain (constr.cpp:391)
10,10,              constrain_lincs (clincs.cpp:2423)
10,10,                GOMP_parallel (parallel.c:168)
10,10,                  constrain_lincs._omp_fn.3 (clincs.cpp:2436)
10,10,                    do_lincs (clincs.cpp:1045)
10,10,                      dd_move_x_constraints (domdec_constraints.cpp:96)
10,10,                        dd_move_x_specat (domdec_specatomcomm.cpp:266)
10,10,                          dd_sendrecv2_rvec (domdec_network.cpp:205)
10,10,                            PMPI_Waitall
10,10,                              ompi_request_default_wait_all
10,10,                                sync_wait_mt
8,8,                                  sched_yield (syscall-template.S:84)
2,2,                                  opal_progress
2,2,                                    opal_timer_base_get_usec_sys_timer
16,16,      gmx::mdrunner (runner.cpp:1359)
8,8,        gmx_pmeonly (pme-only.cpp:198)
8,8,          gmx_pme_recv_coeffs_coords (pme-pp.cpp:480)
1.64.sses,Threads,Function
56,56,main (gmx.cpp:60)
56,56,  gmx::CommandLineModuleManager::run (cmdlinemodulemanager.cpp:583)
56,56,    gmx_mdrun (mdrun.cpp:549)
40,40,      gmx::mdrunner (runner.cpp:1341)
14,14,        gmx::do_md (md.cpp:1097)
14,14,          do_force (sim_util.cpp:1975)
5,5,            do_force_cutsVERLET (sim_util.cpp:1091)
5,5,              dd_move_x (domdec.cpp:469)
5,5,                dd_sendrecv_rvec (domdec_network.cpp:141)
5,5,                  PMPI_Sendrecv
5,5,                    mca_pml_ob1_send
5,5,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
1,1,                        opal_progress
1,1,                          opal_libevent2022_event_base_loop
1,1,                            mca_btl_tcp_component_recv_handler
1,1,                              mca_btl_tcp_proc_lookup
1,1,                                mca_btl_tcp_add_procs
1,1,                                  mca_btl_tcp_proc_create
1,1,                                    mca_btl_tcp_proc_destruct
1,1,                                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1,                                        __lll_lock_wait (lowlevellock.S:135)
9,9,            do_force_cutsVERLET (sim_util.cpp:1380)
9,9,              dd_move_f (domdec.cpp:548)
9,9,                dd_sendrecv_rvec (domdec_network.cpp:141)
9,9,                  PMPI_Sendrecv
5,5,                    mca_pml_ob1_send
5,5,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
1,1,                        opal_progress
1,1,                          opal_libevent2022_event_base_loop
1,1,                            mca_btl_tcp_component_recv_handler
1,1,                              mca_btl_tcp_proc_lookup
1,1,                                mca_btl_tcp_add_procs
1,1,                                  mca_btl_tcp_proc_create
1,1,                                    mca_btl_tcp_proc_destruct
1,1,                                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1,                                        __lll_lock_wait (lowlevellock.S:135)
4,4,                    ompi_request_default_wait
4,4,                      sync_wait_mt
4,4,                        sched_yield (syscall-template.S:84)
26,26,        gmx::do_md (md.cpp:1437)
26,26,          update_constraints (update.cpp:1359)
16,16,            constrain (constr.cpp:372)
16,16,              dd_move_x_constraints (domdec_constraints.cpp:96)
16,16,                dd_move_x_specat (domdec_specatomcomm.cpp:274)
16,16,                  dd_sendrecv2_rvec (domdec_network.cpp:205)
16,16,                    PMPI_Waitall
16,16,                      ompi_request_default_wait_all
16,16,                        sync_wait_mt
10,10,                          sched_yield (syscall-template.S:84)
6,6,                          opal_progress
5,5,                            opal_timer_base_get_usec_sys_timer
1,1,                            mca_btl_sm_component_progress
10,10,            constrain (constr.cpp:391)
10,10,              constrain_lincs (clincs.cpp:2423)
10,10,                GOMP_parallel (parallel.c:168)
10,10,                  constrain_lincs._omp_fn.3 (clincs.cpp:2436)
10,10,                    do_lincs (clincs.cpp:1045)
10,10,                      dd_move_x_constraints (domdec_constraints.cpp:96)
10,10,                        dd_move_x_specat (domdec_specatomcomm.cpp:266)
10,10,                          dd_sendrecv2_rvec (domdec_network.cpp:205)
10,10,                            PMPI_Waitall
10,10,                              ompi_request_default_wait_all
10,10,                                sync_wait_mt
8,8,                                  sched_yield (syscall-template.S:84)
2,2,                                  opal_progress
2,2,                                    opal_timer_base_get_usec_sys_timer
16,16,      gmx::mdrunner (runner.cpp:1359)
8,8,        gmx_pmeonly (pme-only.cpp:198)
8,8,          gmx_pme_recv_coeffs_coords (pme-pp.cpp:480)
8,8,            PMPI_Recv
8,8,              mca_pml_ob1_recv
8,8,                sync_wait_mt
6,6,                  sched_yield (syscall-template.S:84)
2,2,                  opal_progress
2,2,                    opal_timer_base_get_usec_sys_timer
2,2,        gmx_pmeonly (pme-only.cpp:238)
2,2,          gmx_pme_do (pme.cpp:1036)
2,2,            do_redist_pos_coeffs (pme-redistribute.cpp:473)
2,2,              dd_pmeredist_pos_coeffs (pme-redistribute.cpp:307)
2,2,                pme_dd_sendrecv (pme-redistribute.cpp:243)
2,2,                  PMPI_Sendrecv
2,2,                    ompi_request_default_wait
2,2,                      sync_wait_mt
2,2,                        sched_yield (syscall-template.S:84)
6,6,        gmx_pmeonly (pme-only.cpp:244)
6,6,          gmx_pme_send_force_vir_ener (pme-pp.cpp:850)
6,6,            PMPI_Waitall
6,6,              ompi_request_default_wait_all
6,6,                sync_wait_mt
4,4,                  sched_yield (syscall-template.S:84)
2,2,                  opal_progress
1,1,                    mca_btl_vader_component_progress
1,1,                    btl_openib_component_progress
1,1,                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
56,56,progress_engine
56,56,  opal_libevent2022_event_base_loop
56,56,    poll_dispatch
56,56,      poll (syscall-template.S:84)
,8,            PMPI_Recv
8,8,              mca_pml_ob1_recv
8,8,                sync_wait_mt
6,6,                  sched_yield (syscall-template.S:84)
2,2,                  opal_progress
2,2,                    opal_timer_base_get_usec_sys_timer
2,2,        gmx_pmeonly (pme-only.cpp:238)
2,2,          gmx_pme_do (pme.cpp:1036)
2,2,            do_redist_pos_coeffs (pme-redistribute.cpp:473)
2,2,              dd_pmeredist_pos_coeffs (pme-redistribute.cpp:307)
2,2,                pme_dd_sendrecv (pme-redistribute.cpp:243)
2,2,                  PMPI_Sendrecv
2,2,                    ompi_request_default_wait
2,2,                      sync_wait_mt
2,2,                        sched_yield (syscall-template.S:84)
6,6,        gmx_pmeonly (pme-only.cpp:244)
6,6,          gmx_pme_send_force_vir_ener (pme-pp.cpp:850)
6,6,            PMPI_Waitall
6,6,              ompi_request_default_wait_all
6,6,                sync_wait_mt
4,4,                  sched_yield (syscall-template.S:84)
2,2,                  opal_progress
1,1,                    mca_btl_vader_component_progress
1,1,                    btl_openib_component_progress
1,1,                      __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
56,56,progress_engine
56,56,  opal_libevent2022_event_base_loop
56,56,    poll_dispatch
56,56,      poll (syscall-template.S:84)


More information about the gromacs.org_gmx-developers mailing list