[gmx-users] Gromacs 4.6 segmentation fault with mdrun
Raf Ponsaerts
raf.ponsaerts at med.kuleuven.be
Sat Nov 24 05:38:46 CET 2012
Hi Szilárd and Roland,
Thanks for the clear explanation!
I will compile release-4.6 (instead of the nbnxn_hybrid_acc branch) and
do some further testing in a few weeks since I'm currently using the
machine for production-runs with gmx-4.5.5.
Thanks for your time and effort!
regards,
raf
On Thu, 2012-11-22 at 00:23 +0100, Szilárd Páll wrote:
> Roland,
>
>
> He explicitly stated that he is using 20da718 which is also from the
> nbnxn_hybrid_acc branch.
>
>
> Raf, as Roland said, get the release-4-6 ad try again!
>
>
>
>
> There's an important thing to mention: your hardware configuration is
> probably quite imbalanced and the default settings are certainly not
> the best to run with: two MPI processes/threads with 24 OpenMP threads
> + a GPU each. GROMACS works best with balanced hardware configuration
> and yours is certainly not balanced, the GPUs will not be able to keep
> up with 64 CPU cores.
>
>
> Regarding the run configuration most importantly, in most cases you
> should avoid running a group of OpenMP threads across sockets (except
> on Intel, <=12-16 threads). On these Opterons running OpenMP at most
> on a half CPU is recommended (the CPUs are in reality two CPU dies
> bolted together) and in fact you might be better off with even less
> threads per MPI process/thread. This means that multiple processes
> will have to share a GPU which is not optimal and work only with MPI
> in the current version.
>
>
> So to conclude, to get the best performance you should try a few
> combinations:
>
>
> # process 0,1 will use GPU0, process 2,3 GPU1
>
> # this avoids running across sockets, but for aforementioned reasons
> it will still be suboptimal
> mpirun -np 4 mdrun_mpi -gpu_id 0011
>
>
> # process 0,1,2,3 will use GPU0, process 4,5,6,7 GPU1
>
> # this config will probably still be slower than the next one
> mpirun -np 8 mdrun_mpi -gpu_id 000011111
>
>
> # process 0,1,2,3,4,5,6,7 will use GPU0, process 8,9,10,11,12,13,14,15
> GPU1
>
> # this config will probably still be slower than the next one
> mpirun -np 16 mdrun_mpi -gpu_id 00000000111111111
>
>
> You should go ahead and try with 32 and 64 processes as well, I
> suspect that 2 or 3 threads/process will be the fastest. Depending on
> what system you are simulating, this could lead to load imbalance, but
> that you'll have to see.
>
>
> If it turns out that the "Wait for GPU" time is more than a few
> percent (which will probably be the case), it means that a GTX 580 is
> not fast enough for two of these Opterons. What you can try is to run
> using the "hybrid" mode with "-nb gpu_cpu" which might help.
>
>
>
> Cheers,
>
> --
> Szilárd
>
>
> On Sat, Nov 17, 2012 at 3:11 AM, Roland Schulz <roland at utk.edu> wrote:
> Hi Raf,
>
> which version of Gromacs did you use? If you used branch
> nbnxn_hybrid_acc
> please use branch release-4-6 instead and see whether that
> fixes your
> issue. If not please open a bug and upload your log file and
> your tpr.
>
> Roland
>
>
> On Thu, Nov 15, 2012 at 5:13 PM, Raf Ponsaerts <
> raf.ponsaerts at med.kuleuven.be> wrote:
>
> > Hi Szilárd,
> >
> > I assume I get the same segmentation fault error as
> Sebastian (don't
> > shoot if not so). I have 2 NVIDA GTX580 cards (and 4x12-core
> amd64
> > opteron 6174).
> >
> > in brief :
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0x7fffc07f8700 (LWP 32035)]
> > 0x00007ffff61de301 in nbnxn_make_pairlist.omp_fn.2 ()
> > from /usr/local/gromacs/bin/../lib/libmd.so.6
> >
> > Also -nb cpu with Verlet cutoff-scheme results in this
> error...
> >
> > gcc 4.4.5 (Debian 4.4.5-8), Linux kernel 3.1.1
> > CMake 2.8.7
> >
> > If I attach the mdrun.debug output file to this mail, the
> mail to the
> > list gets bounced by the mailserver (because mdrun.debug >
> 50 Kb).
> >
> > Hoping this might help,
> >
> > regards,
> >
> > raf
> > ===========
> > compiled code :
> > commit 20da7188b18722adcd53088ec30e5f256af62f20
> > Author: Szilard Pall <pszilard at cbr.su.se>
> > Date: Tue Oct 2 00:29:33 2012 +0200
> >
> > ===========
> > (gdb) exec mdrun
> > (gdb) run -debug 1 -v -s test.tpr
> >
> > Reading file test.tpr, VERSION 4.6-dev-20121002-20da718
> (single
> > precision)
> > [New Thread 0x7ffff3844700 (LWP 31986)]
> > [Thread 0x7ffff3844700 (LWP 31986) exited]
> > [New Thread 0x7ffff3844700 (LWP 31987)]
> > [Thread 0x7ffff3844700 (LWP 31987) exited]
> > Changing nstlist from 10 to 50, rlist from 2 to 2.156
> >
> > Starting 2 tMPI threads
> > [New Thread 0x7ffff3844700 (LWP 31992)]
> > Using 2 MPI threads
>
> > Using 24 OpenMP threads per tMPI thread
> >
> > 2 GPUs detected:
> > #0: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC: no,
> stat:
> > compatible
> > #1: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC: no,
> stat:
> > compatible
> >
> > 2 GPUs auto-selected to be used for this run: #0, #1
> >
> >
> > Back Off! I just backed up ctab14.xvg to ./#ctab14.xvg.1#
> > Initialized GPU ID #1: GeForce GTX 580
> > [New Thread 0x7ffff3043700 (LWP 31993)]
> >
> > Back Off! I just backed up dtab14.xvg to ./#dtab14.xvg.1#
> >
> > Back Off! I just backed up rtab14.xvg to ./#rtab14.xvg.1#
> > [New Thread 0x7ffff1b3c700 (LWP 31995)]
> > [New Thread 0x7ffff133b700 (LWP 31996)]
> > [New Thread 0x7ffff0b3a700 (LWP 31997)]
> > [New Thread 0x7fffebfff700 (LWP 31998)]
> > [New Thread 0x7fffeb7fe700 (LWP 31999)]
> > [New Thread 0x7fffeaffd700 (LWP 32000)]
> > [New Thread 0x7fffea7fc700 (LWP 32001)]
> > [New Thread 0x7fffe9ffb700 (LWP 32002)]
> > [New Thread 0x7fffe97fa700 (LWP 32003)]
> > [New Thread 0x7fffe8ff9700 (LWP 32004)]
> > [New Thread 0x7fffe87f8700 (LWP 32005)]
> > [New Thread 0x7fffe7ff7700 (LWP 32006)]
> > [New Thread 0x7fffe77f6700 (LWP 32007)]
> > [New Thread 0x7fffe6ff5700 (LWP 32008)]
> > [New Thread 0x7fffe67f4700 (LWP 32009)]
> > [New Thread 0x7fffe5ff3700 (LWP 32010)]
> > [New Thread 0x7fffe57f2700 (LWP 32011)]
> > [New Thread 0x7fffe4ff1700 (LWP 32012)]
> > [New Thread 0x7fffe47f0700 (LWP 32013)]
> > [New Thread 0x7fffe3fef700 (LWP 32014)]
> > [New Thread 0x7fffe37ee700 (LWP 32015)]
> > [New Thread 0x7fffe2fed700 (LWP 32016)]
> > [New Thread 0x7fffe27ec700 (LWP 32017)]
> > Initialized GPU ID #0: GeForce GTX 580
>
> > Using CUDA 8x8x8 non-bonded kernels
>
> > [New Thread 0x7fffe1feb700 (LWP 32018)]
> > [New Thread 0x7fffe0ae4700 (LWP 32019)]
> > [New Thread 0x7fffcbfff700 (LWP 32020)]
> > [New Thread 0x7fffcb7fe700 (LWP 32021)]
> > [New Thread 0x7fffcaffd700 (LWP 32022)]
> > [New Thread 0x7fffca7fc700 (LWP 32023)]
> > [New Thread 0x7fffc9ffb700 (LWP 32024)]
> > [New Thread 0x7fffc97fa700 (LWP 32025)]
> > [New Thread 0x7fffc8ff9700 (LWP 32026)]
> > [New Thread 0x7fffc3fff700 (LWP 32027)]
> > [New Thread 0x7fffc37fe700 (LWP 32028)]
> > [New Thread 0x7fffc2ffd700 (LWP 32029)]
> > [New Thread 0x7fffc27fc700 (LWP 32031)]
> > [New Thread 0x7fffc1ffb700 (LWP 32032)]
> > [New Thread 0x7fffc17fa700 (LWP 32033)]
> > [New Thread 0x7fffc0ff9700 (LWP 32034)]
> > [New Thread 0x7fffc07f8700 (LWP 32035)]
> > [New Thread 0x7fffbfff7700 (LWP 32036)]
> > [New Thread 0x7fffbf7f6700 (LWP 32037)]
> > [New Thread 0x7fffbeff5700 (LWP 32038)]
> > [New Thread 0x7fffbe7f4700 (LWP 32039)]
> > [New Thread 0x7fffbdff3700 (LWP 32040)]
> > [New Thread 0x7fffbd7f2700 (LWP 32042)]
> > [New Thread 0x7fffbcff1700 (LWP 32043)]
> > Making 1D domain decomposition 2 x 1 x 1
> >
>
> > * WARNING * WARNING * WARNING * WARNING * WARNING * WARNING
> *
> > We have just committed the new CPU detection code in this
> branch,
> > and will commit new SSE/AVX kernels in a few days. However,
> this
> > means that currently only the NxN kernels are accelerated!
>
> > In the mean time, you might want to avoid production runs in
> 4.6.
> >
> >
>
> > Back Off! I just backed up traj.trr to ./#traj.trr.1#
> >
> > Back Off! I just backed up traj.xtc to ./#traj.xtc.1#
> >
> > Back Off! I just backed up ener.edr to ./#ener.edr.1#
> > starting mdrun 'Protein in water'
>
> > 100000 steps, 200.0 ps.
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0x7fffc07f8700 (LWP 32035)]
> > 0x00007ffff61de301 in nbnxn_make_pairlist.omp_fn.2 ()
> > from /usr/local/gromacs/bin/../lib/libmd.so.6
> > (gdb)
> >
> > ============================================
> > Verlet, nb by cpu only:
> >
> > (gdb) run -debug 1 -nb cpu -v -s test.tpr
> >
> > Reading file test.tpr, VERSION 4.6-dev-20121002-20da718
> (single
> > precision)
> > [New Thread 0x7ffff3844700 (LWP 32050)]
> > [Thread 0x7ffff3844700 (LWP 32050) exited]
> > [New Thread 0x7ffff3844700 (LWP 32051)]
> > [Thread 0x7ffff3844700 (LWP 32051) exited]
> > Starting 48 tMPI threads
> > [New Thread 0x7ffff3844700 (LWP 32058)]
> > [New Thread 0x7ffff3043700 (LWP 32059)]
> > [New Thread 0x7ffff2842700 (LWP 32060)]
> > [New Thread 0x7ffff2041700 (LWP 32061)]
> > [New Thread 0x7ffff1840700 (LWP 32062)]
> > [New Thread 0x7ffff103f700 (LWP 32063)]
> > [New Thread 0x7ffff083e700 (LWP 32064)]
> > [New Thread 0x7fffe3fff700 (LWP 32065)]
> > [New Thread 0x7fffe37fe700 (LWP 32066)]
> > [New Thread 0x7fffe2ffd700 (LWP 32067)]
> > [New Thread 0x7fffe27fc700 (LWP 32068)]
> > [New Thread 0x7fffe1ffb700 (LWP 32069)]
> > [New Thread 0x7fffe17fa700 (LWP 32070)]
> > [New Thread 0x7fffe0ff9700 (LWP 32071)]
> > [New Thread 0x7fffdbfff700 (LWP 32072)]
> > [New Thread 0x7fffdb7fe700 (LWP 32073)]
> > [New Thread 0x7fffdaffd700 (LWP 32074)]
> > [New Thread 0x7fffda7fc700 (LWP 32075)]
> > [New Thread 0x7fffd9ffb700 (LWP 32076)]
> > [New Thread 0x7fffd97fa700 (LWP 32077)]
> > [New Thread 0x7fffd8ff9700 (LWP 32078)]
> > [New Thread 0x7fffd3fff700 (LWP 32079)]
> > [New Thread 0x7fffd37fe700 (LWP 32080)]
> > [New Thread 0x7fffd2ffd700 (LWP 32081)]
> > [New Thread 0x7fffd27fc700 (LWP 32082)]
> > [New Thread 0x7fffd1ffb700 (LWP 32083)]
> > [New Thread 0x7fffd17fa700 (LWP 32084)]
> > [New Thread 0x7fffd0ff9700 (LWP 32085)]
> > [New Thread 0x7fffd07f8700 (LWP 32086)]
> > [New Thread 0x7fffcfff7700 (LWP 32087)]
> > [New Thread 0x7fffcf7f6700 (LWP 32088)]
> > [New Thread 0x7fffceff5700 (LWP 32089)]
> > [New Thread 0x7fffce7f4700 (LWP 32090)]
> > [New Thread 0x7fffcdff3700 (LWP 32091)]
> > [New Thread 0x7fffcd7f2700 (LWP 32092)]
> > [New Thread 0x7fffccff1700 (LWP 32093)]
> > [New Thread 0x7fffcc7f0700 (LWP 32094)]
> > [New Thread 0x7fffcbfef700 (LWP 32095)]
> > [New Thread 0x7fffcb7ee700 (LWP 32096)]
> > [New Thread 0x7fffcafed700 (LWP 32097)]
> > [New Thread 0x7fffca7ec700 (LWP 32098)]
> > [New Thread 0x7fffc9feb700 (LWP 32099)]
> > [New Thread 0x7fffc97ea700 (LWP 32100)]
> > [New Thread 0x7fffc8fe9700 (LWP 32101)]
> > [New Thread 0x7fffc87e8700 (LWP 32102)]
> > [New Thread 0x7fffc7fe7700 (LWP 32103)]
> > [New Thread 0x7fffc77e6700 (LWP 32104)]
> >
> > Will use 45 particle-particle and 3 PME only nodes
> > This is a guess, check the performance at the end of the log
> file
> > Using 48 MPI threads
>
> > Using 1 OpenMP thread per tMPI thread
> >
> > 2 GPUs detected:
> > #0: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC: no,
> stat:
> > compatible
> > #1: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC: no,
> stat:
> > compatible
> >
> >
> > Back Off! I just backed up ctab14.xvg to ./#ctab14.xvg.2#
> >
> > Back Off! I just backed up dtab14.xvg to ./#dtab14.xvg.2#
> >
> > Back Off! I just backed up rtab14.xvg to ./#rtab14.xvg.2#
> > Using SSE2 4x4 non-bonded kernels
> > Making 3D domain decomposition 3 x 5 x 3
> >
>
> > * WARNING * WARNING * WARNING * WARNING * WARNING * WARNING
> *
> > We have just committed the new CPU detection code in this
> branch,
> > and will commit new SSE/AVX kernels in a few days. However,
> this
> > means that currently only the NxN kernels are accelerated!
>
> > In the mean time, you might want to avoid production runs in
> 4.6.
> >
> >
>
> > Back Off! I just backed up traj.trr to ./#traj.trr.2#
> >
> > Back Off! I just backed up traj.xtc to ./#traj.xtc.2#
> >
> > Back Off! I just backed up ener.edr to ./#ener.edr.2#
> > starting mdrun 'Protein in water'
>
> > 100000 steps, 200.0 ps.
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0x7fffcd7f2700 (LWP 32092)]
> > 0x00007ffff61db499 in nbnxn_make_pairlist.omp_fn.2 ()
> > from /usr/local/gromacs/bin/../lib/libmd.so.6
> > (gdb)
> > =============================================
> >
> >
> > On Mon, 2012-11-12 at 19:37 +0100, Szilárd Páll wrote:
>
> > > Hi Sebastian,
> > >
> > > That is very likely a bug so I'd appreciate if you could
> provide a bit
> > more
> > > information, like:
> > >
> > > - OS, compiler
> > >
> > > - results of runs with the following configurations:
> > > - "mdrun -nb cpu" (to run CPU-only with Verlet scheme)
> > > - "GMX_EMULATE_GPU=1 mdrun -nb gpu" (to run GPU emulation
> using plain
> > C
> > > kernels);
> > > - "mdrun" without any arguments (which will use 2x(n/2
> cores + 1
> > GPU))
> > > - "mdrun -ntmpi 1" without any other arguments (which
> will use n
> > cores +
> > > the first GPU)
> > >
> > > - please attach the log files of all failed and a
> successful run as
> > well as
> > > the mdrun.debug file from a failed runs that you can
> obtain with
> > "mdrun
> > > -debug 1"
> > >
> > > Note that a backtrace would be very useful and if you can
> get one I'd
> > > be grateful, but for now the above should be minimum
> effort and I'll
> > > provide simple introductions to get a backtrace later (if
> needed).
> > >
> > > Thanks,
> > >
> > > --
> > > Szilárd
> > >
> > >
> > > On Mon, Nov 12, 2012 at 6:22 PM, sebastian <
> > > sebastian.waltz at physik.uni-freiburg.de> wrote:
> > >
> > > > On 11/12/2012 04:12 PM, sebastian wrote:
> > > > > Dear GROMACS user,
> > > > >
> > > > > I am running in major problems trying to use gromacs
> 4.6 on my
> > desktop
> > > > > with two GTX 670 GPU's and one i7 cpu. On the system I
> installed
> > the
> > > > > CUDA 4.2, running fine for many different test
> programs.
> > > > > Compiling the git version of gromacs 4.6 with hybrid
> acceleration
> > I get
> > > > > one error message of a missing libxml2 but it compiles
> with no
> > further
> > > > > complaints. The tools I tested (like g_rdf or grompp
> usw.) work
> > fine as
> > > > > long as I generate the tpr files with the right
> gromacs version.
> > > > > Now, if I try to use mdrun (GMX_GPU_ID=1 mdrun -nt 1
> -v
> > -deffnm ....)
> > > > > the preparation seems to work fine until it starts the
> actual run.
> > It
> > > > > stops with a segmentation fault:
> > > > >
> > > > > Reading file pdz_cis_ex_200ns_test.tpr, VERSION
> > > > > 4.6-dev-20121002-20da718-dirty (single precision)
> > > > >
> > > > > Using 1 MPI thread
> > > > >
> > > > > Using 1 OpenMP thread
> > > > >
> > > > >
> > > > > 2 GPUs detected:
> > > > >
> > > > > #0: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC:
> no, stat:
> > > > compatible
> > > > >
> > > > > #1: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC:
> no, stat:
> > > > compatible
> > > > >
> > > > >
> > > > > 1 GPU user-selected to be used for this run: #1
> > > > >
> > > > >
> > > > > Using CUDA 8x8x8 non-bonded kernels
> > > > >
> > > > >
> > > > > * WARNING * WARNING * WARNING * WARNING * WARNING *
> WARNING *
> > > > >
> > > > > We have just committed the new CPU detection code in
> this branch,
> > > > >
> > > > > and will commit new SSE/AVX kernels in a few days.
> However, this
> > > > >
> > > > > means that currently only the NxN kernels are
> accelerated!
> > > > >
> > > >
> > > > Since it does run on a pure CPU run (without the verlet
> cut-off
> > scheme)
> > > > does it maybe help to change the NxN kernels manually
> in the .mdp
> > file
> > > > (how can I do so)? Or is there something wrong using the
> CUDA 4.2
> > > > version or what so ever. The libxml2 should not be a
> problem since
> > the
> > > > pure CPU run works.
> > > >
> > > > > In the mean time, you might want to avoid production
> runs in 4.6.
> > > > >
> > > > >
> > > > > Back Off! I just backed up pdz_cis_ex_200ns_test.trr
> to
> > > > > ./#pdz_cis_ex_200ns_test.trr.4#
> > > > >
> > > > >
> > > > > Back Off! I just backed up pdz_cis_ex_200ns_test.xtc
> to
> > > > > ./#pdz_cis_ex_200ns_test.xtc.4#
> > > > >
> > > > >
> > > > > Back Off! I just backed up pdz_cis_ex_200ns_test.edr
> to
> > > > > ./#pdz_cis_ex_200ns_test.edr.4#
> > > > >
> > > > > starting mdrun 'Protein in water'
> > > > >
> > > > > 3500000 steps, 7000.0 ps.
> > > > >
> > > > > Segmentation fault
> > > > >
> > > > >
> > > > > Since I have no idea whats going wrong any help is
> welcomed.
> > > > > Attached you find the log file.
> > > > >
> > > >
> > > > Help is really appreciated since I want to use my new
> desktop
> > including
> > > > the GPU's
> > > >
> > > > > Thanks a lot
> > > > >
> > > > > Sebastian
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > gmx-users mailing list gmx-users at gromacs.org
> > > > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/Search
> before posting!
> > > > * Please don't post (un)subscribe requests to the list.
> Use the
> > > > www interface or send it to
> gmx-users-request at gromacs.org.
> > > > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > --
> > > gmx-users mailing list gmx-users at gromacs.org
> > > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/Search before
> posting!
> > > * Please don't post (un)subscribe requests to the list.
> Use the
> > > www interface or send it to gmx-users-request at gromacs.org.
> > > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> >
> >
> >
> > --
> > gmx-users mailing list gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/Search before
> posting!
> > * Please don't post (un)subscribe requests to the list. Use
> the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> >
> >
> >
> >
> >
>
>
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before
> posting!
> * Please don't post (un)subscribe requests to the list. Use
> the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
>
>
>
More information about the gromacs.org_gmx-users
mailing list