[gmx-developers] Gromacs 2016.3 (and earlier) freezing up.
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Mon Sep 11 16:42:19 CEST 2017
My debugger run finally got to the lockup.
All processes are waiting on various MPI operations.
Attached a stack dump of all 56 tasks.
I'll keep the debug session running for a while in case anyone wants
some more detailed data.
This is a RelwithDeb build though so not everything is available.
On 09/08/2017 11:28 AM, Berk Hess wrote:
> But you should be able to get some (limited) information by attaching a
> debugger to an aldready running process with a release build.
>
> If you plan on compiling and running a new case, use a release + debug
> symbols build. That should run as fast as a release build.
>
> Cheers,
>
> Berk
>
> On 2017-09-08 11:23, Åke Sandgren wrote:
>> We have, at least, one case that when run over 2 nodes, or more, quite
>> often (always) hangs, i.e. no more output in md.log or otherwise while
>> mdrun still consumes cpu time. It takes a random time before it happens,
>> like 1-3 days.
>>
>> The case can be shared if someone else wants to investigate. I'm
>> planning to run it in the debugger to be able to break and look at
>> states when it happens, but since it takes so long with the production
>> build it is not something i'm looking forward to.
>>
>> On 09/08/2017 11:13 AM, Berk Hess wrote:
>>> Hi,
>>>
>>> We are far behind schedule for the 2017 release. We are working hard on
>>> it, but I don't think we can promise a date yet.
>>>
>>> We have a 2016.4 release planned for this week (might slip to next
>>> week). But if you can give us enough details to track down your hanging
>>> issue, we might be able to fix it in 2016.4.
>
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
-------------- next part --------------
Processes,Threads,Function
56,56,main (gmx.cpp:60)
56,56, gmx::CommandLineModuleManager::run (cmdlinemodulemanager.cpp:583)
56,56, gmx_mdrun (mdrun.cpp:549)
40,40, gmx::mdrunner (runner.cpp:1341)
14,14, gmx::do_md (md.cpp:1097)
14,14, do_force (sim_util.cpp:1975)
5,5, do_force_cutsVERLET (sim_util.cpp:1091)
5,5, dd_move_x (domdec.cpp:469)
5,5, dd_sendrecv_rvec (domdec_network.cpp:141)
5,5, PMPI_Sendrecv
5,5, mca_pml_ob1_send
5,5, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
1,1, opal_progress
1,1, opal_libevent2022_event_base_loop
1,1, mca_btl_tcp_component_recv_handler
1,1, mca_btl_tcp_proc_lookup
1,1, mca_btl_tcp_add_procs
1,1, mca_btl_tcp_proc_create
1,1, mca_btl_tcp_proc_destruct
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1, __lll_lock_wait (lowlevellock.S:135)
9,9, do_force_cutsVERLET (sim_util.cpp:1380)
9,9, dd_move_f (domdec.cpp:548)
9,9, dd_sendrecv_rvec (domdec_network.cpp:141)
9,9, PMPI_Sendrecv
5,5, mca_pml_ob1_send
5,5, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
1,1, opal_progress
1,1, opal_libevent2022_event_base_loop
1,1, mca_btl_tcp_component_recv_handler
1,1, mca_btl_tcp_proc_lookup
1,1, mca_btl_tcp_add_procs
1,1, mca_btl_tcp_proc_create
1,1, mca_btl_tcp_proc_destruct
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1, __lll_lock_wait (lowlevellock.S:135)
4,4, ompi_request_default_wait
4,4, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
26,26, gmx::do_md (md.cpp:1437)
26,26, update_constraints (update.cpp:1359)
16,16, constrain (constr.cpp:372)
16,16, dd_move_x_constraints (domdec_constraints.cpp:96)
16,16, dd_move_x_specat (domdec_specatomcomm.cpp:274)
16,16, dd_sendrecv2_rvec (domdec_network.cpp:205)
16,16, PMPI_Waitall
16,16, ompi_request_default_wait_all
16,16, sync_wait_mt
10,10, sched_yield (syscall-template.S:84)
6,6, opal_progress
5,5, opal_timer_base_get_usec_sys_timer
1,1, mca_btl_sm_component_progress
10,10, constrain (constr.cpp:391)
10,10, constrain_lincs (clincs.cpp:2423)
10,10, GOMP_parallel (parallel.c:168)
10,10, constrain_lincs._omp_fn.3 (clincs.cpp:2436)
10,10, do_lincs (clincs.cpp:1045)
10,10, dd_move_x_constraints (domdec_constraints.cpp:96)
10,10, dd_move_x_specat (domdec_specatomcomm.cpp:266)
10,10, dd_sendrecv2_rvec (domdec_network.cpp:205)
10,10, PMPI_Waitall
10,10, ompi_request_default_wait_all
10,10, sync_wait_mt
8,8, sched_yield (syscall-template.S:84)
2,2, opal_progress
2,2, opal_timer_base_get_usec_sys_timer
16,16, gmx::mdrunner (runner.cpp:1359)
8,8, gmx_pmeonly (pme-only.cpp:198)
8,8, gmx_pme_recv_coeffs_coords (pme-pp.cpp:480)
1.64.sses,Threads,Function
56,56,main (gmx.cpp:60)
56,56, gmx::CommandLineModuleManager::run (cmdlinemodulemanager.cpp:583)
56,56, gmx_mdrun (mdrun.cpp:549)
40,40, gmx::mdrunner (runner.cpp:1341)
14,14, gmx::do_md (md.cpp:1097)
14,14, do_force (sim_util.cpp:1975)
5,5, do_force_cutsVERLET (sim_util.cpp:1091)
5,5, dd_move_x (domdec.cpp:469)
5,5, dd_sendrecv_rvec (domdec_network.cpp:141)
5,5, PMPI_Sendrecv
5,5, mca_pml_ob1_send
5,5, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
1,1, opal_progress
1,1, opal_libevent2022_event_base_loop
1,1, mca_btl_tcp_component_recv_handler
1,1, mca_btl_tcp_proc_lookup
1,1, mca_btl_tcp_add_procs
1,1, mca_btl_tcp_proc_create
1,1, mca_btl_tcp_proc_destruct
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1, __lll_lock_wait (lowlevellock.S:135)
9,9, do_force_cutsVERLET (sim_util.cpp:1380)
9,9, dd_move_f (domdec.cpp:548)
9,9, dd_sendrecv_rvec (domdec_network.cpp:141)
9,9, PMPI_Sendrecv
5,5, mca_pml_ob1_send
5,5, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
1,1, opal_progress
1,1, opal_libevent2022_event_base_loop
1,1, mca_btl_tcp_component_recv_handler
1,1, mca_btl_tcp_proc_lookup
1,1, mca_btl_tcp_add_procs
1,1, mca_btl_tcp_proc_create
1,1, mca_btl_tcp_proc_destruct
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
1,1, __lll_lock_wait (lowlevellock.S:135)
4,4, ompi_request_default_wait
4,4, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
26,26, gmx::do_md (md.cpp:1437)
26,26, update_constraints (update.cpp:1359)
16,16, constrain (constr.cpp:372)
16,16, dd_move_x_constraints (domdec_constraints.cpp:96)
16,16, dd_move_x_specat (domdec_specatomcomm.cpp:274)
16,16, dd_sendrecv2_rvec (domdec_network.cpp:205)
16,16, PMPI_Waitall
16,16, ompi_request_default_wait_all
16,16, sync_wait_mt
10,10, sched_yield (syscall-template.S:84)
6,6, opal_progress
5,5, opal_timer_base_get_usec_sys_timer
1,1, mca_btl_sm_component_progress
10,10, constrain (constr.cpp:391)
10,10, constrain_lincs (clincs.cpp:2423)
10,10, GOMP_parallel (parallel.c:168)
10,10, constrain_lincs._omp_fn.3 (clincs.cpp:2436)
10,10, do_lincs (clincs.cpp:1045)
10,10, dd_move_x_constraints (domdec_constraints.cpp:96)
10,10, dd_move_x_specat (domdec_specatomcomm.cpp:266)
10,10, dd_sendrecv2_rvec (domdec_network.cpp:205)
10,10, PMPI_Waitall
10,10, ompi_request_default_wait_all
10,10, sync_wait_mt
8,8, sched_yield (syscall-template.S:84)
2,2, opal_progress
2,2, opal_timer_base_get_usec_sys_timer
16,16, gmx::mdrunner (runner.cpp:1359)
8,8, gmx_pmeonly (pme-only.cpp:198)
8,8, gmx_pme_recv_coeffs_coords (pme-pp.cpp:480)
8,8, PMPI_Recv
8,8, mca_pml_ob1_recv
8,8, sync_wait_mt
6,6, sched_yield (syscall-template.S:84)
2,2, opal_progress
2,2, opal_timer_base_get_usec_sys_timer
2,2, gmx_pmeonly (pme-only.cpp:238)
2,2, gmx_pme_do (pme.cpp:1036)
2,2, do_redist_pos_coeffs (pme-redistribute.cpp:473)
2,2, dd_pmeredist_pos_coeffs (pme-redistribute.cpp:307)
2,2, pme_dd_sendrecv (pme-redistribute.cpp:243)
2,2, PMPI_Sendrecv
2,2, ompi_request_default_wait
2,2, sync_wait_mt
2,2, sched_yield (syscall-template.S:84)
6,6, gmx_pmeonly (pme-only.cpp:244)
6,6, gmx_pme_send_force_vir_ener (pme-pp.cpp:850)
6,6, PMPI_Waitall
6,6, ompi_request_default_wait_all
6,6, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
2,2, opal_progress
1,1, mca_btl_vader_component_progress
1,1, btl_openib_component_progress
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
56,56,progress_engine
56,56, opal_libevent2022_event_base_loop
56,56, poll_dispatch
56,56, poll (syscall-template.S:84)
,8, PMPI_Recv
8,8, mca_pml_ob1_recv
8,8, sync_wait_mt
6,6, sched_yield (syscall-template.S:84)
2,2, opal_progress
2,2, opal_timer_base_get_usec_sys_timer
2,2, gmx_pmeonly (pme-only.cpp:238)
2,2, gmx_pme_do (pme.cpp:1036)
2,2, do_redist_pos_coeffs (pme-redistribute.cpp:473)
2,2, dd_pmeredist_pos_coeffs (pme-redistribute.cpp:307)
2,2, pme_dd_sendrecv (pme-redistribute.cpp:243)
2,2, PMPI_Sendrecv
2,2, ompi_request_default_wait
2,2, sync_wait_mt
2,2, sched_yield (syscall-template.S:84)
6,6, gmx_pmeonly (pme-only.cpp:244)
6,6, gmx_pme_send_force_vir_ener (pme-pp.cpp:850)
6,6, PMPI_Waitall
6,6, ompi_request_default_wait_all
6,6, sync_wait_mt
4,4, sched_yield (syscall-template.S:84)
2,2, opal_progress
1,1, mca_btl_vader_component_progress
1,1, btl_openib_component_progress
1,1, __GI___pthread_mutex_lock (pthread_mutex_lock.c:80)
56,56,progress_engine
56,56, opal_libevent2022_event_base_loop
56,56, poll_dispatch
56,56, poll (syscall-template.S:84)
More information about the gromacs.org_gmx-developers
mailing list