[gmx-users] Problem with mdrun on 8CPUs and 1GPU for gromacs 4.6.3

Kirill Lykov kirill.lykov at usi.ch
Tue Jun 2 16:45:40 CEST 2015


Dear Gromacs users,

I'm trying to get the best performance out of a cluster which has for every node 8 CPUs and 1 GPU. To check it out, I run martini polarisable water system. Yet I have problems with it. While mdrun works for one MPI process, it crashes for 8 MPI processes and 1 GPU. Below is the whole sbatch script:


#SBATCH --ntasks=8

#SBATCH --ntasks-per-node=8

#SBATCH --cpus-per-task=1


export CRAY_CUDA_MPS=1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

time aprun -B mdrun_mpi-gpu -gpu_id 00000000 -ntomp 1 -deffnm md -v -c md.gro

But it gives me just error about broken pipe:

_pmiu_daemon(SIGCHLD): [NID 02124] [c1-1c0s3n0] [Tue Jun  2 16:25:48 2015] PE RANK 2 exit signal Broken pipe

[NID 02124] 2015-06-02 16:25:48 Apid 4833499: initiated application termination

I also tried to use one MPI task and 8 OpenMP and other combinations, but always get the same error.

>From the core file of the mdrun crash I have the following:
> gdb mdrun core

#0  0x00002aaab2969885 in read_alias_file () from /lib64/libc.so.6

#1  0x00002aaab1612f65 in PMPI_Abort () from /opt/cray/lib64/libmpich_gnu_48.so.2

#2  0x00002aaaab909682 in gmx_abort (noderank=noderank at entry=4, nnodes=nnodes at entry=8, errorno=errorno at entry=-1) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/network.c:518

#3  0x00002aaaab841dec in quit_gmx (msg=<optimized out>) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gmx_fatal.c:266

#4  0x00002aaaab842345 in _gmx_error (key=<optimized out>, msg=<optimized out>, file=0x2aaaabd6b010 <CSWTCH.6+40304> "/apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gpu_utils/gpu_utils.cu",

    line=511) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/gmxlib/gmx_fatal.c:774

#5  0x00002aaaabd274c5 in init_gpu () from /apps/daint/gromacs/4.6.3/gnu_481/lib/libgmx_mpi.so.8

#6  0x00002aaaab1875ba in pick_nbnxn_resources (hwinfo=0x667e70, bDoNonbonded=<optimized out>, bUseGPU=bUseGPU at entry=0x6be1e0, bEmulateGPU=bEmulateGPU at entry=0x7fffffff3c60, cr=<optimized out>,

    cr=<optimized out>, fp=<optimized out>) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:1686

#7  0x00002aaaab19152b in init_nb_verlet (nbpu_opt=0x4481aa <cross_sec_h+4426> "auto", cr=0x65e4d0, fr=0x6bd140, ir=0x667810, nb_verlet=0x6bd328, fp=0x0)

    at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:1890

#8  init_forcerec (fp=0x0, oenv=oenv at entry=0x667780, fr=fr at entry=0x6bd140, fcd=fcd at entry=0xa45c70, ir=ir at entry=0x667810, mtop=mtop at entry=0x667c40, cr=cr at entry=0x65e4d0, box=box at entry=0x7fffffff3f20,

    bMolEpot=bMolEpot at entry=0, tabfn=0x6683d0 "dppc-gm1-2.xvg", tabafn=tabafn at entry=0x668410 "dppc-gm1-2.xvg", tabpfn=tabpfn at entry=0x668450 "dppc-gm1-2.xvg", tabbfn=tabbfn at entry=0x668490 "dppc-gm1-2.xvg",

    nbpu_opt=nbpu_opt at entry=0x4481aa <cross_sec_h+4426> "auto", bNoSolvOpt=bNoSolvOpt at entry=0, print_force=print_force at entry=-1) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/mdlib/forcerec.c:2890

#9  0x000000000040fef6 in mdrunner (hw_opt=hw_opt at entry=0x7fffffff59b0, fplog=0x0, cr=cr at entry=0x65e4d0, nfile=nfile at entry=36, fnm=fnm at entry=0x7fffffff5fc0, oenv=0x667780, bVerbose=bVerbose at entry=1,

    bCompact=bCompact at entry=1, nstglobalcomm=-1, ddxyz=ddxyz at entry=0x7fffffff5900, dd_node_order=dd_node_order at entry=1, rdd=<optimized out>, rconstr=<optimized out>,

    dddlb_opt=dddlb_opt at entry=0x4481aa <cross_sec_h+4426> "auto", dlb_scale=0.800000012, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0, nbpu_opt=<optimized out>, nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=0,

    repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1, deviceOptions=deviceOptions at entry=0x4481f3 <cross_sec_h+4499> "", Flags=<optimized out>)

    at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/kernel/runner.c:1404

#10 0x000000000043b58d in cmain (argc=1, argv=0x667720) at /apps/santis/sandbox/lucamar/src/gromacs-4.6.3/src/kernel/mdrun.c:737


[1] https://www.acrc.a-star.edu.sg/docs/ASTAR%20GPU%20symposium-22th-Jan-2014.pdf


Best regards,

Kirill


More information about the gromacs.org_gmx-users mailing list