[gmx-developers] threads are now ON by default

Berk Hess hess at cbr.su.se
Mon Feb 15 16:35:58 CET 2010


I think I have fixed all the deadlocks that occured at checkpointing and
at termination
through sending TERM or USR1 to one of the mdrun processes.

Berk

Sander Pronk wrote:
> I've looked at it, and there were two different problems: first, my implementation of MPI_Waitall() had a potential deadlock when called with 0 MPI_Requests (I've fixed that), and there's a deadlock in src/tools/md.c that involves the broadcasting of when to checkpoint. 
> I've talked to Berk, and he's working on fixing that.
>
> Sander
>
>
>
> On 13 Feb 2010, at 14:08 , Alexey Shvetsov wrote:
>
>   
>> Hi,
>> Yes its on the same place. Looks like there is some kind of race conditions.
>> Input files available via ftp://alexxy.gentoo.ru/pub/gmx/
>>
>> speptide.tar.bz2 contains finished run with md (leap frog)
>> speptide-vv.tar.bz2 contains crushed run with md-vv
>> i get deadlock near step 14k of md run without restrains
>>
>> On Суббота 13 февраля 2010 00:23:51 Michael Shirts wrote:
>>     
>>> Hi, Alexey-
>>>
>>> Thanks for tracking this down.  md-vv is still getting the kinks
>>> worked out.  Is this in the same place as the bug you were seeing a
>>> couple of days ago, or a different place?
>>>
>>> Sander, perhaps if you could check for non-threadsafeness (looks like
>>> its in write_traj) since you're a bit more familiar -- if you can't
>>> see it quickly, please let me know, and I'll try to track it down!
>>>
>>> Best,
>>> Michael
>>>
>>>       
>>>> Date: Fri, 12 Feb 2010 23:21:03 +0300
>>>> From: Alexey Shvetsov <alexxyum at gmail.com>
>>>> Subject: Re: [gmx-developers] threads are now ON by default
>>>> To: Discussion list for GROMACS development
>>>>       <gmx-developers at gromacs.org>
>>>> Message-ID: <201002122321.15014.alexxyum at gmail.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> On Пятница 12 февраля 2010 19:32:46 Sander Pronk wrote:
>>>>         
>>>>> Now that the last issues have been resolved with the threading code,
>>>>> thread-based parallelization has been turned on by default. To disable
>>>>> all the threading code, use
>>>>>
>>>>> --disable-threads.
>>>>>
>>>>> in configure, or turn the option GMX_THREADS off with ccmake.
>>>>>
>>>>> Running mdrun with just one thread (the default) is almost exactly the
>>>>> same as running it without threading code: the only thing that's
>>>>> different is that the few remaining global variables are protected by
>>>>> mutexes.
>>>>>
>>>>> Performance-wise mdrun runs very slightly faster with threads than with
>>>>> OpenMPI when Nthreads<=Ncores (and there is no other processes on the
>>>>> computer). When Nthreads>Ncores (or other processes are running), the
>>>>> thread code is much faster than OpenMPI, but the total runtime is still
>>>>> smaller than when Nthreads==Ncores.
>>>>>
>>>>> If there's any problems in getting things running, or with performance,
>>>>> I'd very much like to hear about it.
>>>>>
>>>>> Sander
>>>>>           
>>>> Good news.
>>>> but looks like i get deadlock when running gromacs with 4 threads (intel
>>>> core i5 750) with md-vv integrator. I can share all input files. Same
>>>> system with almost same input parameters except integrator (md) runs
>>>> fine with 4 threads backtrace
>>>> 0x00002b9c9fd9dd87 in sched_yield () at
>>>> ../sysdeps/unix/syscall-template.S:82 82    
>>>> ../sysdeps/unix/syscall-template.S: No such file or directory. in
>>>> ../sysdeps/unix/syscall-template.S
>>>> (gdb) bt
>>>> #0  0x00002b9c9fd9dd87 in sched_yield () at ../sysdeps/unix/syscall-
>>>> template.S:82
>>>> #1  0x00002b9c9f5aeb5d in tMPI_Gather (sendbuf=0x4, sendcount=<value
>>>> optimized out>, sendtype=<value optimized out>, recvbuf=<value optimized
>>>> out>, recvcount=<value optimized out>, recvtype=<value optimized out>,
>>>> root=0, comm=0xee2160)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/gather.c:9
>>>> 8 #2  0x00002b9c9f17885e in dd_gather (dd=<value optimized out>,
>>>> nbytes=1607464840, src=0x1290748, dest=0xffffffffffffffff)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec_network.c:233
>>>> #3  0x00002b9c9f16f841 in dd_collect_cg (dd=<value optimized out>,
>>>> state_local=0x12d3600, lv=<value optimized out>, v=0x2b9cb4001010)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1263
>>>> #4  dd_collect_vec (dd=<value optimized out>, state_local=0x12d3600,
>>>> lv=<value optimized out>, v=0x2b9cb4001010)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1420
>>>> #5  0x00002b9c9f170df9 in dd_collect_state (dd=0x1287df0,
>>>> state_local=0x12d3600, state=0x2b9ca83782b0)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1474
>>>> #6  0x00002b9c9f1d0b7b in write_traj (fplog=<value optimized out>,
>>>> cr=0x2b9ca8377ad0, fp_trn=<value optimized out>, bX=-1, bV=8, bF=0,
>>>> fp_xtc=-1, bXTC=0,
>>>>   xtc_prec=1000, fn_cpt=0xee2490 "speptide.md.cpt", bCPT=1,
>>>> top_global=0x2b9ca83780b0, eIntegrator=10, simulation_part=1, step=14690,
>>>>   t=29.379999999999999, state_local=0x12d3600,
>>>> state_global=0x2b9ca83782b0, f_local=0x12d8b00, f_global=0x2b9cb4c02010,
>>>> n_xtc=0x7fff5fd02398, x_xtc=0x7fff5fd02320) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/stat.c:473
>>>> #7  0x0000000000414318 in do_md (fplog=<value optimized out>, cr=<value
>>>> optimized out>, nfile=<value optimized out>, fnm=<value optimized out>,
>>>>   oenv=<value optimized out>, bVerbose=<value optimized out>,
>>>> bCompact=1, nstglobalcomm=1, vsite=0x0, constr=0x129b5d0, stepout=100,
>>>> ir=0x2b9ca8377b40, top_global=0x2b9ca83780b0, fcd=0x1287c60,
>>>> state_global=0x2b9ca83782b0, mdatoms=0x1296de0, nrnb=0x1290b90,
>>>> wcycle=0x1290800, ed=0x0, fr=0x1291200, repl_ex_nst=0, repl_ex_seed=-1,
>>>> cpt_period=<value optimized out>, max_hours=<value optimized out>,
>>>> Flags=7168, runtime=0x7fff5fd02710) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/md.c:1943
>>>> #8  0x000000000040f155 in mdrunner (fplog=0xee1c50, cr=0x2b9ca8377ad0,
>>>> nfile=<value optimized out>, fnm=<value optimized out>, oenv=<value
>>>> optimized out>,
>>>>   bVerbose=<value optimized out>, bCompact=1, nstglobalcomm=-1,
>>>> ddxyz=0x7fff5fd02844, dd_node_order=1, rdd=<value optimized out>,
>>>>   rconstr=<value optimized out>, dddlb_opt=0x41f64d "auto",
>>>> dlb_scale=<value optimized out>, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0,
>>>> nstepout=100, resetstep=-1, nmultisim=0, repl_ex_nst=0, repl_ex_seed=-1,
>>>> pforce=<value optimized out>, cpt_period=<value optimized out>,
>>>> max_hours=<value optimized out>, Flags=7168) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:669
>>>> #9  0x0000000000410168 in mdrunner_start_fn (arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:170
>>>> #10 0x00002b9c9f5af8d4 in tMPI_Thread_starter (arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/tmpi_init.
>>>> c:360 #11 0x00002b9c9f5afc04 in tMPI_Init_fn (N=19466056,
>>>> start_function=<value optimized out>, arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/tmpi_init.
>>>> c:472 #12 0x000000000040ff19 in mdrunner_threads (nthreads=4,
>>>> fplog=<value optimized out>, cr=<value optimized out>, nfile=<value
>>>> optimized out>,
>>>>   fnm=<value optimized out>, oenv=<value optimized out>, bVerbose=1,
>>>> bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff5fd04b10, dd_node_order=1,
>>>>   rdd=<value optimized out>, rconstr=<value optimized out>,
>>>> dddlb_opt=0x41f64d "auto", dlb_scale=<value optimized out>, ddcsx=0x0,
>>>> ddcsy=0x0, ddcsz=0x0,
>>>>   nstepout=100, resetstep=-1, nmultisim=0, repl_ex_nst=0,
>>>> repl_ex_seed=-1, pforce=<value optimized out>, cpt_period=<value
>>>> optimized out>,
>>>>   max_hours=<value optimized out>, Flags=7168) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:238
>>>> #13 0x0000000000419a1b in main (argc=6, argv=0x7fff5fd04ce8) at
>>>> /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/mdrun.c:519
>>>> Current language:  auto
>>>> The current source language is "auto; currently asm".
>>>> (gdb) up
>>>> #1  0x00002b9c9f5aeb5d in tMPI_Gather (sendbuf=0x4, sendcount=<value
>>>> optimized out>, sendtype=<value optimized out>, recvbuf=<value optimized
>>>> out>, recvcount=<value optimized out>, recvtype=<value optimized out>,
>>>> root=0, comm=0xee2160)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/gather.c:9
>>>> 8 98                      TMPI_YIELD_WAIT(cur);
>>>> Current language:  auto
>>>> The current source language is "auto; currently c".
>>>> (gdb) up
>>>> #2  0x00002b9c9f17885e in dd_gather (dd=<value optimized out>,
>>>> nbytes=1607464840, src=0x1290748, dest=0xffffffffffffffff)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec_network.c:233
>>>> 233         MPI_Gather(src,nbytes,MPI_BYTE,
>>>> (gdb) up
>>>> #3  0x00002b9c9f16f841 in dd_collect_cg (dd=<value optimized out>,
>>>> state_local=0x12d3600, lv=<value optimized out>, v=0x2b9cb4001010)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1263
>>>> 1263        dd_gather(dd,2*sizeof(int),buf2,ibuf);
>>>> (gdb) up
>>>> #4  dd_collect_vec (dd=<value optimized out>, state_local=0x12d3600,
>>>> lv=<value optimized out>, v=0x2b9cb4001010)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1420
>>>> 1420        dd_collect_cg(dd,state_local);
>>>> (gdb) up
>>>> #5  0x00002b9c9f170df9 in dd_collect_state (dd=0x1287df0,
>>>> state_local=0x12d3600, state=0x2b9ca83782b0)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/domdec.c:1474
>>>> 1474                  
>>>> dd_collect_vec(dd,state_local,state_local->x,state-
>>>>
>>>>         
>>>>> x);
>>>>>
>>>>>           
>>>> (gdb) up
>>>> #6  0x00002b9c9f1d0b7b in write_traj (fplog=<value optimized out>,
>>>> cr=0x2b9ca8377ad0, fp_trn=<value optimized out>, bX=-1, bV=8, bF=0,
>>>> fp_xtc=-1, bXTC=0,
>>>>   xtc_prec=1000, fn_cpt=0xee2490 "speptide.md.cpt", bCPT=1,
>>>> top_global=0x2b9ca83780b0, eIntegrator=10, simulation_part=1, step=14690,
>>>>   t=29.379999999999999, state_local=0x12d3600,
>>>> state_global=0x2b9ca83782b0, f_local=0x12d8b00, f_global=0x2b9cb4c02010,
>>>> n_xtc=0x7fff5fd02398, x_xtc=0x7fff5fd02320) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/mdlib/stat.c:473
>>>> 473                 dd_collect_state(cr->dd,state_local,state_global);
>>>> (gdb) up
>>>> #7  0x0000000000414318 in do_md (fplog=<value optimized out>, cr=<value
>>>> optimized out>, nfile=<value optimized out>, fnm=<value optimized out>,
>>>>   oenv=<value optimized out>, bVerbose=<value optimized out>,
>>>> bCompact=1, nstglobalcomm=1, vsite=0x0, constr=0x129b5d0, stepout=100,
>>>> ir=0x2b9ca8377b40, top_global=0x2b9ca83780b0, fcd=0x1287c60,
>>>> state_global=0x2b9ca83782b0, mdatoms=0x1296de0, nrnb=0x1290b90,
>>>> wcycle=0x1290800, ed=0x0, fr=0x1291200, repl_ex_nst=0, repl_ex_seed=-1,
>>>> cpt_period=<value optimized out>, max_hours=<value optimized out>,
>>>> Flags=7168, runtime=0x7fff5fd02710) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/md.c:1943
>>>> 1943                write_traj(fplog,cr,fp_trn,bX,bV,bF,fp_xtc,bXTC,ir-
>>>>
>>>>         
>>>>> xtcprec,
>>>>>
>>>>>           
>>>> (gdb) up
>>>> #8  0x000000000040f155 in mdrunner (fplog=0xee1c50, cr=0x2b9ca8377ad0,
>>>> nfile=<value optimized out>, fnm=<value optimized out>, oenv=<value
>>>> optimized out>,
>>>>   bVerbose=<value optimized out>, bCompact=1, nstglobalcomm=-1,
>>>> ddxyz=0x7fff5fd02844, dd_node_order=1, rdd=<value optimized out>,
>>>>   rconstr=<value optimized out>, dddlb_opt=0x41f64d "auto",
>>>> dlb_scale=<value optimized out>, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0,
>>>> nstepout=100, resetstep=-1, nmultisim=0, repl_ex_nst=0, repl_ex_seed=-1,
>>>> pforce=<value optimized out>, cpt_period=<value optimized out>,
>>>> max_hours=<value optimized out>, Flags=7168) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:669
>>>> 669             integrator[inputrec->eI].func(fplog,cr,nfile,fnm,
>>>> (gdb) up
>>>> #9  0x0000000000410168 in mdrunner_start_fn (arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:170
>>>> 170         mda->ret=mdrunner(fplog, cr, mc.nfile, mc.fnm, mc.oenv,
>>>> mc.bVerbose,
>>>> (gdb) up
>>>> #10 0x00002b9c9f5af8d4 in tMPI_Thread_starter (arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/tmpi_init.
>>>> c:360 360             th->start_fn(th->start_arg);
>>>> (gdb) up
>>>> #11 0x00002b9c9f5afc04 in tMPI_Init_fn (N=19466056, start_function=<value
>>>> optimized out>, arg=<value optimized out>)
>>>>   at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/gmxlib/thread_mpi/tmpi_init.
>>>> c:472 472             tMPI_Start_threads(N, 0, 0, start_function, arg);
>>>> (gdb) up
>>>> #12 0x000000000040ff19 in mdrunner_threads (nthreads=4, fplog=<value
>>>> optimized out>, cr=<value optimized out>, nfile=<value optimized out>,
>>>>   fnm=<value optimized out>, oenv=<value optimized out>, bVerbose=1,
>>>> bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff5fd04b10, dd_node_order=1,
>>>>   rdd=<value optimized out>, rconstr=<value optimized out>,
>>>> dddlb_opt=0x41f64d "auto", dlb_scale=<value optimized out>, ddcsx=0x0,
>>>> ddcsy=0x0, ddcsz=0x0,
>>>>   nstepout=100, resetstep=-1, nmultisim=0, repl_ex_nst=0,
>>>> repl_ex_seed=-1, pforce=<value optimized out>, cpt_period=<value
>>>> optimized out>,
>>>>   max_hours=<value optimized out>, Flags=7168) at /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/runner.c:238
>>>> 238             tMPI_Init_fn(nthreads, mdrunner_start_fn, (void*)(&mda)
>>>> ); (gdb) up
>>>> #13 0x0000000000419a1b in main (argc=6, argv=0x7fff5fd04ce8) at
>>>> /var/tmp/portage/sci-
>>>> chemistry/gromacs-9999/work/gromacs-9999/src/kernel/mdrun.c:519
>>>> 519       rc = mdrunner_threads(nthreads,
>>>> (gdb) up
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Alexey 'Alexxy' Shvetsov
>>>> Petersburg Nuclear Physics Institute, Russia
>>>> Department of Molecular and Radiation Biophysics
>>>> Gentoo Team Ru
>>>> Gentoo Linux Dev
>>>> mailto:alexxyum at gmail.com
>>>> mailto:alexxy at gentoo.org
>>>> mailto:alexxy at omrb.pnpi.spb.ru
>>>> -------------- next part --------------
>>>> A non-text attachment was scrubbed...
>>>> Name: not available
>>>> Type: application/pgp-signature
>>>> Size: 198 bytes
>>>> Desc: This is a digitally signed message part.
>>>> Url :
>>>> http://lists.gromacs.org/pipermail/gmx-developers/attachments/20100212/1
>>>> 59ee41f/attachment.bin
>>>>
>>>> ------------------------------
>>>>
>>>> --
>>>> gmx-developers mailing list
>>>> gmx-developers at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>>>
>>>>
>>>> End of gmx-developers Digest, Vol 70, Issue 8
>>>> *********************************************
>>>>         
>> -- 
>> Best Regards,
>> Alexey 'Alexxy' Shvetsov
>> Petersburg Nuclear Physics Institute, Russia
>> Department of Molecular and Radiation Biophysics
>> Gentoo Team Ru
>> Gentoo Linux Dev
>> mailto:alexxyum at gmail.com
>> mailto:alexxy at gentoo.org
>> mailto:alexxy at omrb.pnpi.spb.ru
>> -- 
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the 
>> www interface or send it to gmx-developers-request at gromacs.org.
>>     
>
>   




More information about the gromacs.org_gmx-developers mailing list