[gmx-users] shortage of shared memory

chris.neale at utoronto.ca chris.neale at utoronto.ca
Sun Jul 8 06:36:22 CEST 2007


I have a variety of systems that run in parallel without ever having  
errors due to shortage of shared memory (up to 500K atoms). However, I  
find that I sometimes run into this problem with lipid bilayer systems  
of less than 30K atoms.

If I submit a job and I get the shared memory error the error occurs  
before any simulation time. What's more, if I resubmit the job it  
often works fine. Howerver, one recent bilayer system set up by a  
colleague won't ever run.

I am using openmpi_v1.2.1 and I can avoid using shared memory like this:

${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
(etc...)

That absolutely fixes the error, but when I do that the scaling to 4  
processors is very poor as judged by walltime and also by the output  
at the end of the gromacs .log file.

This also confuses me since my sysadmin tells me that gromacs doesn't  
use shared memory.

I get two basic error messages. Sometimes it is this to stderr:  
[cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress] SM  
faild to send message due to shortage of shared memory.

And sometimes it is a longer style error message (see the end of this  
email for all stderr from a run of that type.)

I believe this to be a problem with our cluster, and I guess that  
would make this the wrong mailing list for this question, but I am  
hoping that somebody can help me clarify what is going on with shared  
memory usage in gromacs and perhaps why the error appears to be  
stochastic but also related to bilayers.

Our cluster is also having some problems with random xtc or trr file  
corruption (1 in 10 to 20 runs) in case that seems related to the  
shared memory issue. However, that is not the issue that I am  
presenting in this post.

Thanks,
Chris.

########## Here is the stderr
########## Following this is the x0.log file, but that doesn't appear  
to have error indications in it

NNODES=4, MYRANK=1, HOSTNAME=cn-r1-27
NNODES=4, MYRANK=0, HOSTNAME=cn-r1-27
NODEID=0 argc=8
                          :-)  G  R  O  M  A  C  S  (-:

NODEID=1 argc=8
NNODES=4, MYRANK=2, HOSTNAME=cn-r1-27
NODEID=2 argc=8
NNODES=4, MYRANK=3, HOSTNAME=cn-r1-27
NODEID=3 argc=8
                   Green Red Orange Magenta Azure Cyan Skyblue

                             :-)  VERSION 3.3.1  (-:


       Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
        Copyright (c) 1991-2000, University of Groningen, The Netherlands.
              Copyright (c) 2001-2006, The GROMACS development team,
             check out http://www.gromacs.org for more information.

          This program is free software; you can redistribute it and/or
           modify it under the terms of the GNU General Public License
          as published by the Free Software Foundation; either version 2
              of the License, or (at your option) any later version.

     :-)   
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2   
(-:

Option     Filename  Type         Description
------------------------------------------------------------
   -s bilayer_popc_md219.tpr  Input        Generic run input: tpr tpb tpa xml
   -o bilayer_popc_md219.trr  Output       Full precision trajectory: trr trj
   -x bilayer_popc_md219.xtc  Output, Opt. Compressed trajectory (portable xdr
                                    format)
   -c bilayer_popc_md219.gro  Output       Generic structure: gro g96 pdb xml
   -e bilayer_popc_md219.edr  Output       Generic energy: edr ene
   -g bilayer_popc_md219.log  Output       Log file
-dgdlbilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-fieldilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-tableilayer_popc_md219.xvg  Input, Opt.  xvgr/xmgr file
-tableplayer_popc_md219.xvg  Input, Opt.  xvgr/xmgr file
-rerunilayer_popc_md219.xtc  Input, Opt.  Generic trajectory: xtc trr trj gro
                                    g96 pdb
-tpi bilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
  -ei bilayer_popc_md219.edi  Input, Opt.  ED sampling input
  -eo bilayer_popc_md219.edo  Output, Opt. ED sampling output
   -j bilayer_popc_md219.gct  Input, Opt.  General coupling stuff
  -jo bilayer_popc_md219.gct  Output, Opt. General coupling stuff
-ffoutilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-devoutlayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-runavilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
  -pi bilayer_popc_md219.ppa  Input, Opt.  Pull parameters
  -po bilayer_popc_md219.ppa  Output, Opt. Pull parameters
  -pd bilayer_popc_md219.pdo  Output, Opt. Pull data output
  -pn bilayer_popc_md219.ndx  Input, Opt.  Index file
-mtx bilayer_popc_md219.mtx  Output, Opt. Hessian matrix
  -dn bilayer_popc_md219.ndx  Output, Opt. Index file

       Option   Type  Value  Description
------------------------------------------------------
       -[no]h   bool     no  Print help info and quit
       -[no]X   bool     no  Use dialog box GUI to edit command line options
        -nice    int     19  Set the nicelevel
      -deffnm string bilayer_popc_md219  Set the default filename for all file
                             options
    -[no]xvgr   bool    yes  Add specific codes (legends etc.) in the output
                             xvg files for the xmgrace program
          -np    int      4  Number of nodes, must be the same as used for
                             grompp
          -nt    int      1  Number of threads to start on each node
       -[no]v   bool    yes  Be loud and noisy
-[no]compact   bool    yes  Write a compact log file
-[no]sepdvdl   bool     no  Write separate V and dVdl terms for each
                             interaction type and node to the log file(s)
   -[no]multi   bool     no  Do multiple simulations in parallel (only with
                             -np > 1)
      -replex    int      0  Attempt replica exchange every # steps
      -reseed    int     -1  Seed for replica exchange, -1 is generate a seed
    -[no]glas   bool     no  Do glass simulation with special long range
                             corrections
  -[no]ionize   bool     no  Do a simulation including the effect of an X-Ray
                             bombardment on your system

Getting Loaded...
Reading file bilayer_popc_md219.tpr, VERSION 3.3.1 (single precision)
Loaded with Money

[cn-r1-27:26937] *** Process received signal ***
[cn-r1-27:26937] Signal: Segmentation fault (11)
[cn-r1-27:26937] Signal code: Address not mapped (1)
[cn-r1-27:26937] Failing at address: 0x18
[cn-r1-27:26937] [ 0] /lib64/tls/libpthread.so.0 [0x2a969a0730]
[cn-r1-27:26937] [ 1]  
/tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x6b)  
[0x2a9a488c3b]
[cn-r1-27:26937] [ 2]  
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x14d)  
[0x2a9a1765ed]
[cn-r1-27:26937] [ 3]  
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_process_pending+0x1c4)  
[0x2a9a177af4]
[cn-r1-27:26937] [ 4]  
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_match_completion_free+0x22e)  
[0x2a9a17827e]
[cn-r1-27:26937] [ 5]  
/tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x846)  
[0x2a9a489fe6]
[cn-r1-27:26937] [ 6]  
/tools/openmpi/1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2a)  
[0x2a9a27e47a]
[cn-r1-27:26937] [ 7]  
/tools/openmpi/1.2/lib/libopen-pal.so.0(opal_progress+0x4a)  
[0x2a966405da]
[cn-r1-27:26937] [ 8]  
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_request_wait_all+0xad)  
[0x2a9637adbd]
[cn-r1-27:26937] [ 9]  
/tools/openmpi/1.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x2ab)  
[0x2a9a9af39b]
[cn-r1-27:26937] [10]  
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_nextcid+0x20f)  
[0x2a9636b36f]
[cn-r1-27:26937] [11]  
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_dup+0x94) [0x2a96369bd4]
[cn-r1-27:26937] [12]  
/tools/openmpi/1.2/lib/libmpi.so.0(PMPI_Comm_dup+0x6f) [0x2a96390e0f]
[cn-r1-27:26937] [13]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(gmx_parallel_3dfft_init+0x72)  
[0x48f902]
[cn-r1-27:26937] [14]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mk_fftgrid+0xd1)  
[0x467f11]
[cn-r1-27:26937] [15]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(init_pme+0x4c0)  
[0x460860]
[cn-r1-27:26937] [16]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mdrunner+0x84f)  
[0x42d8ff]
[cn-r1-27:26937] [17]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(main+0x237)  
[0x42e357]
[cn-r1-27:26937] [18] /lib64/tls/libc.so.6(__libc_start_main+0xea)  
[0x2a96ac4aaa]
[cn-r1-27:26937] [19]  
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(XmCreateOptionMenu+0x42)  
[0x416d8a]
[cn-r1-27:26937] *** End of error message ***
mpirun noticed that job rank 0 with PID 26934 on node cn-r1-27 exited  
on signal 15 (Terminated).


#######################



####################### Here is the log file from the head node


$cat bilayer_popclrlj_md1910.log
Log file opened on Sun May 20 18:12:34 2007
Host: cn-r4-29  pid: 9245  nodeid: 0  nnodes:  4
The Gromacs distribution was built Mon Mar 19 11:20:43 EDT 2007 by
cneale at cn-r1-3 (Linux 2.6.5-7.282-smp x86_64)


                          :-)  G  R  O  M  A  C  S  (-:

                   Gromacs Runs On Most of All Computer Systems

                             :-)  VERSION 3.3.1  (-:


       Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
        Copyright (c) 1991-2000, University of Groningen, The Netherlands.
              Copyright (c) 2001-2006, The GROMACS development team,
             check out http://www.gromacs.org for more information.

          This program is free software; you can redistribute it and/or
           modify it under the terms of the GNU General Public License
          as published by the Free Software Foundation; either version 2
              of the License, or (at your option) any later version.

     :-)   
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2   
(-:


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

CPU=  0, lastcg= 5041, targetcg=15126, myshift=    2
CPU=  1, lastcg=10083, targetcg=20168, myshift=    2
CPU=  2, lastcg=15126, targetcg= 5042, myshift=    2
CPU=  3, lastcg=20168, targetcg=10084, myshift=    2
nsb->shift =   2, nsb->bshift=  0
Listing Scalars
nsb->nodeid:         0
nsb->nnodes:      4
nsb->cgtotal: 20169
nsb->natoms:  51236
nsb->shift:       2
nsb->bshift:      0
Nodeid   index  homenr  cgload  workload
      0       0   12808    5042      5042
      1   12808   12808   10084     10084
      2   25616   12812   15127     15127
      3   38428   12808   20169     20169

parameters of the run:
    integrator           = md
    nsteps               = 250000
    init_step            = 0
    ns_type              = Grid
    nstlist              = 10
    ndelta               = 2
    bDomDecomp           = FALSE
    decomp_dir           = 0
    nstcomm              = 1
    comm_mode            = Linear
    nstcheckpoint        = 1000
    nstlog               = 1000
    nstxout              = 250000
    nstvout              = 250000
    nstfout              = 250000
    nstenergy            = 5000
    nstxtcout            = 5000
    init_t               = 87300
    delta_t              = 0.002
    xtcprec              = 1000
    nkx                  = 84
    nky                  = 80
    nkz                  = 60
    pme_order            = 4
    ewald_rtol           = 1e-05
    ewald_geometry       = 0
    epsilon_surface      = 0
    optimize_fft         = FALSE
    ePBC                 = xyz
    bUncStart            = TRUE
    bShakeSOR            = FALSE
    etc                  = Berendsen
    epc                  = Berendsen
    epctype              = Semiisotropic
    tau_p                = 4
    ref_p (3x3):
       ref_p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
       ref_p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
       ref_p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
    compress (3x3):
       compress[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
       compress[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
       compress[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
    andersen_seed        = 815131
    rlist                = 0.9
    coulombtype          = PME
    rcoulomb_switch      = 0
    rcoulomb             = 0.9
    vdwtype              = Cut-off
    rvdw_switch          = 0
    rvdw                 = 1.4
    epsilon_r            = 1
    epsilon_rf           = 1
    tabext               = 1
    gb_algorithm         = Still
    nstgbradii           = 1
    rgbradii             = 2
    gb_saltconc          = 0
    implicit_solvent     = No
    DispCorr             = EnerPres
    fudgeQQ              = 0.5
    free_energy          = no
    init_lambda          = 0
    sc_alpha             = 0
    sc_power             = 0
    sc_sigma             = 0.3
    delta_lambda         = 0
    disre_weighting      = Conservative
    disre_mixed          = FALSE
    dr_fc                = 1000
    dr_tau               = 0
    nstdisreout          = 100
    orires_fc            = 0
    orires_tau           = 0
    nstorireout          = 100
    dihre-fc             = 1000
    dihre-tau            = 0
    nstdihreout          = 100
    em_stepsize          = 0.01
    em_tol               = 10
    niter                = 20
    fc_stepsize          = 0
    nstcgsteep           = 1000
    nbfgscorr            = 10
    ConstAlg             = Lincs
    shake_tol            = 1e-04
    lincs_order          = 4
    lincs_warnangle      = 30
    lincs_iter           = 1
    bd_fric              = 0
    ld_seed              = 1993
    cos_accel            = 0
    deform (3x3):
       deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
       deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
    userint1             = 0
    userint2             = 0
    userint3             = 0
    userint4             = 0
    userreal1            = 0
    userreal2            = 0
    userreal3            = 0
    userreal4            = 0
grpopts:
    nrdf:             33598.8     51892.2
    ref_t:                310         310
    tau_t:                0.1         0.1
anneal:                   No          No
ann_npoints:               0           0
    acc:            0           0           0
    nfreeze:           N           N           N
    energygrp_flags[  0]: 0
    efield-x:
       n = 0
    efield-xt:
       n = 0
    efield-y:
       n = 0
    efield-yt:
       n = 0
    efield-z:
       n = 0
    efield-zt:
       n = 0
    bQMMM                = FALSE
    QMconstraints        = 0
    QMMMscheme           = 0
    scalefactor          = 1
qm_opts:
    ngQM                 = 0
Max number of graph edges per atom is 4
Table routines are used for coulomb: TRUE
Table routines are used for vdw:     FALSE
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 0.9   Coulomb: 0.9   LJ: 1.4
System total charge: 0.000
Generated table with 1200 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Enabling TIP4p water optimization for 8649 molecules.

Will do PME sum in reciprocal space.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------





More information about the gromacs.org_gmx-users mailing list