[gmx-users] Problem with mdrun on 8CPUs and 1GPU for gromacs 4.6.3
Mark Abraham
mark.j.abraham at gmail.com
Tue Jun 2 18:58:24 CEST 2015
Hi,
This is caused by an MPI rank failing to be able to get access to the GPU.
Maybe this could happen if your ranks are not all actually on the same
node? There would be an error message on stderr if your infrastructure
hadn't swallowed it, but I don't know how to configure it to behave better.
Mark
On Tue, Jun 2, 2015 at 4:46 PM Kirill Lykov <kirill.lykov at usi.ch> wrote:
> Dear Gromacs users,
>
> I'm trying to get the best performance out of a cluster which has for
> every node 8 CPUs and 1 GPU. To check it out, I run martini polarisable
> water system. Yet I have problems with it. While mdrun works for one MPI
> process, it crashes for 8 MPI processes and 1 GPU. Below is the whole
> sbatch script:
>
>
> #SBATCH --ntasks=8
>
> #SBATCH --ntasks-per-node=8
>
> #SBATCH --cpus-per-task=1
>
>
> export CRAY_CUDA_MPS=1
>
> export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
>
> time aprun -B mdrun_mpi-gpu -gpu_id 00000000 -ntomp 1 -deffnm md -v -c
> md.gro
>
> But it gives me just error about broken pipe:
>
> _pmiu_daemon(SIGCHLD): [NID 02124] [c1-1c0s3n0] [Tue Jun 2 16:25:48 2015]
> PE RANK 2 exit signal Broken pipe
>
> [NID 02124] 2015-06-02 16:25:48 Apid 4833499: initiated application
> termination
>
> I also tried to use one MPI task and 8 OpenMP and other combinations, but
> always get the same error.
>
> From the core file of the mdrun crash I have the following:
> > gdb mdrun core
>
> #0 0x00002aaab2969885 in read_alias_file () from /lib64/libc.so.6
>
> #1 0x00002aaab1612f65 in PMPI_Abort () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
>
> #2 0x00002aaaab909682 in gmx_abort (noderank=noderank at entry=4,
> nnodes=nnodes at entry=8, errorno=errorno at entry=-1) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/network.c:518
>
> #3 0x00002aaaab841dec in quit_gmx (msg=<optimized out>) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gmx_fatal.c:266
>
> #4 0x00002aaaab842345 in _gmx_error (key=<optimized out>, msg=<optimized
> out>, file=0x2aaaabd6b010 <CSWTCH.6+40304>
> "/apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gpu_utils/
> gpu_utils.cu",
>
> line=511) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gmx_fatal.c:774
>
> #5 0x00002aaaabd274c5 in init_gpu () from
> /apps/daint/gromacs/4.6.3/gnu_481/lib/libgmx_mpi.so.8
>
> #6 0x00002aaaab1875ba in pick_nbnxn_resources (hwinfo=0x667e70,
> bDoNonbonded=<optimized out>, bUseGPU=bUseGPU at entry=0x6be1e0,
> bEmulateGPU=bEmulateGPU at entry=0x7fffffff3c60, cr=<optimized out>,
>
> cr=<optimized out>, fp=<optimized out>) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:1686
>
> #7 0x00002aaaab19152b in init_nb_verlet (nbpu_opt=0x4481aa
> <cross_sec_h+4426> "auto", cr=0x65e4d0, fr=0x6bd140, ir=0x667810,
> nb_verlet=0x6bd328, fp=0x0)
>
> at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:1890
>
> #8 init_forcerec (fp=0x0, oenv=oenv at entry=0x667780, fr=fr at entry=0x6bd140,
> fcd=fcd at entry=0xa45c70, ir=ir at entry=0x667810, mtop=mtop at entry=0x667c40,
> cr=cr at entry=0x65e4d0, box=box at entry=0x7fffffff3f20,
>
> bMolEpot=bMolEpot at entry=0, tabfn=0x6683d0 "dppc-gm1-2.xvg",
> tabafn=tabafn at entry=0x668410 "dppc-gm1-2.xvg", tabpfn=tabpfn at entry=0x668450
> "dppc-gm1-2.xvg", tabbfn=tabbfn at entry=0x668490 "dppc-gm1-2.xvg",
>
> nbpu_opt=nbpu_opt at entry=0x4481aa <cross_sec_h+4426> "auto",
> bNoSolvOpt=bNoSolvOpt at entry=0, print_force=print_force at entry=-1) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:2890
>
> #9 0x000000000040fef6 in mdrunner (hw_opt=hw_opt at entry=0x7fffffff59b0,
> fplog=0x0, cr=cr at entry=0x65e4d0, nfile=nfile at entry=36, fnm=fnm at entry=0x7fffffff5fc0,
> oenv=0x667780, bVerbose=bVerbose at entry=1,
>
> bCompact=bCompact at entry=1, nstglobalcomm=-1, ddxyz=ddxyz at entry=0x7fffffff5900,
> dd_node_order=dd_node_order at entry=1, rdd=<optimized out>,
> rconstr=<optimized out>,
>
> dddlb_opt=dddlb_opt at entry=0x4481aa <cross_sec_h+4426> "auto",
> dlb_scale=0.800000012, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0, nbpu_opt=<optimized
> out>, nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=0,
>
> repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, pforce=-1,
> cpt_period=15, max_hours=-1, deviceOptions=deviceOptions at entry=0x4481f3
> <cross_sec_h+4499> "", Flags=<optimized out>)
>
> at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/kernel/runner.c:1404
>
> #10 0x000000000043b58d in cmain (argc=1, argv=0x667720) at
> /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/kernel/mdrun.c:737
>
>
> [1]
> https://www.acrc.a-star.edu.sg/docs/ASTAR%20GPU%20symposium-22th-Jan-2014.pdf
>
>
> Best regards,
>
> Kirill
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list