[gmx-users] GROMACS Parallel Runs

David van der Spoel spoel at xray.bmc.uu.se
Mon Oct 2 13:57:17 CEST 2006


Sunny wrote:
>> From: David van der Spoel <spoel at xray.bmc.uu.se>
>> Reply-To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> Subject: Re: [gmx-users] GROMACS Parallel Runs
>> Date: Sun, 01 Oct 2006 19:58:48 +0200
>>
>> Sunny wrote:
>>> Hi,
>>>
>>> I am using GROMACS 3.3.1 parallel runs on an AIX supercomputing 
>>> system. My simulation can successfully run on 16 and 32 CPUs (as well 
>>> as below 16 CPUs). When running on 64 CPUs, however, segmentation 
>>> fault occurs in multiple tasks from very beginning of the simulation. 
>>> I'd like know what causes the failure and whether there is any 
>>> solution to fix the failure.
>>>
>>
>> please supply more details, like system size, PME details etc.
>>
>>
>>> Thanks,
>>>
>>> Sunny
>> David.
>>
> 
> Hi all,
> 
> Thanks for your replies. The followings are the full configuration info 
> of my simulatione found in md0.log and the error message given in the 
> .err. I'm sorry for the tedious list.
> 
> Many thanks,
> 
> Sunny
> 
> CONFIGURATION INFO:
> 
> CPU=  0, lastcg=  298, targetcg= 7732, myshift=   23
> CPU=  1, lastcg=  633, targetcg= 8066, myshift=   23
> CPU=  2, lastcg=  970, targetcg= 8404, myshift=   23
> CPU=  3, lastcg= 1298, targetcg= 8732, myshift=   23
> CPU=  4, lastcg= 1629, targetcg= 9062, myshift=   24
> CPU=  5, lastcg= 1959, targetcg= 9392, myshift=   25
> CPU=  6, lastcg= 2296, targetcg= 9730, myshift=   26
> CPU=  7, lastcg= 2624, targetcg=10058, myshift=   27
> CPU=  8, lastcg= 2955, targetcg=10388, myshift=   28
> CPU=  9, lastcg= 3285, targetcg=10718, myshift=   29
> CPU= 10, lastcg= 3622, targetcg=11056, myshift=   30
> CPU= 11, lastcg= 3950, targetcg=11384, myshift=   31
> CPU= 12, lastcg= 4281, targetcg=11714, myshift=   32
> CPU= 13, lastcg= 4611, targetcg=12044, myshift=   33
> CPU= 14, lastcg= 4948, targetcg=12382, myshift=   34
> CPU= 15, lastcg= 5276, targetcg=12710, myshift=   35
> CPU= 16, lastcg= 5607, targetcg=13040, myshift=   36
> CPU= 17, lastcg= 5937, targetcg=13370, myshift=   37
> CPU= 18, lastcg= 6274, targetcg=13708, myshift=   38
> CPU= 19, lastcg= 6602, targetcg=14036, myshift=   39
> CPU= 20, lastcg= 6933, targetcg=14366, myshift=   40
> CPU= 21, lastcg= 7263, targetcg=14696, myshift=   41
> CPU= 22, lastcg= 7600, targetcg=  168, myshift=   42
> CPU= 23, lastcg= 7928, targetcg=  496, myshift=   42
> CPU= 24, lastcg= 8259, targetcg=  826, myshift=   42
> CPU= 25, lastcg= 8589, targetcg= 1156, myshift=   42
> CPU= 26, lastcg= 8840, targetcg= 1408, myshift=   42
> CPU= 27, lastcg= 9003, targetcg= 1570, myshift=   41
> CPU= 28, lastcg= 9166, targetcg= 1734, myshift=   41
> CPU= 29, lastcg= 9329, targetcg= 1896, myshift=   40
> CPU= 30, lastcg= 9492, targetcg= 2060, myshift=   40
> CPU= 31, lastcg= 9655, targetcg= 2222, myshift=   39
> CPU= 32, lastcg= 9818, targetcg= 2386, myshift=   39
> CPU= 33, lastcg= 9981, targetcg= 2548, myshift=   38
> CPU= 34, lastcg=10144, targetcg= 2712, myshift=   38
> CPU= 35, lastcg=10307, targetcg= 2874, myshift=   37
> CPU= 36, lastcg=10470, targetcg= 3038, myshift=   37
> CPU= 37, lastcg=10633, targetcg= 3200, myshift=   36
> CPU= 38, lastcg=10796, targetcg= 3364, myshift=   36
> CPU= 39, lastcg=10959, targetcg= 3526, myshift=   35
> CPU= 40, lastcg=11122, targetcg= 3690, myshift=   35
> CPU= 41, lastcg=11285, targetcg= 3852, myshift=   34
> CPU= 42, lastcg=11448, targetcg= 4016, myshift=   34
> CPU= 43, lastcg=11611, targetcg= 4178, myshift=   33
> CPU= 44, lastcg=11774, targetcg= 4342, myshift=   33
> CPU= 45, lastcg=11937, targetcg= 4504, myshift=   32
> CPU= 46, lastcg=12100, targetcg= 4668, myshift=   32
> CPU= 47, lastcg=12263, targetcg= 4830, myshift=   31
> CPU= 48, lastcg=12426, targetcg= 4994, myshift=   31
> CPU= 49, lastcg=12589, targetcg= 5156, myshift=   30
> CPU= 50, lastcg=12752, targetcg= 5320, myshift=   30
> CPU= 51, lastcg=12915, targetcg= 5482, myshift=   29
> CPU= 52, lastcg=13078, targetcg= 5646, myshift=   29
> CPU= 53, lastcg=13240, targetcg= 5808, myshift=   28
> CPU= 54, lastcg=13403, targetcg= 5970, myshift=   28
> CPU= 55, lastcg=13565, targetcg= 6132, myshift=   27
> CPU= 56, lastcg=13728, targetcg= 6296, myshift=   27
> CPU= 57, lastcg=13890, targetcg= 6458, myshift=   26
> CPU= 58, lastcg=14053, targetcg= 6620, myshift=   26
> CPU= 59, lastcg=14215, targetcg= 6782, myshift=   25
> CPU= 60, lastcg=14378, targetcg= 6946, myshift=   25
> CPU= 61, lastcg=14540, targetcg= 7108, myshift=   24
> CPU= 62, lastcg=14703, targetcg= 7270, myshift=   24
> CPU= 63, lastcg=14865, targetcg= 7432, myshift=   23
> nsb->shift =  42, nsb->bshift=  0
> Listing Scalars
> nsb->nodeid:         0
> nsb->nnodes:     64
> nsb->cgtotal: 14866
> nsb->natoms:  31242
> nsb->shift:      42
> nsb->bshift:      0
> Nodeid   index  homenr  cgload  workload
>     0       0     488     299       299
>     1     488     491     634       634
>     2     979     488     971       971
>     3    1467     488    1299      1299
>     4    1955     488    1630      1630
>     5    2443     486    1960      1960
>     6    2929     488    2297      2297
>     7    3417     488    2625      2625
>     8    3905     488    2956      2956
>     9    4393     486    3286      3286
>    10    4879     488    3623      3623
>    11    5367     488    3951      3951
>    12    5855     488    4282      4282
>    13    6343     486    4612      4612
>    14    6829     488    4949      4949
>    15    7317     488    5277      5277
>    16    7805     488    5608      5608
>    17    8293     486    5938      5938
>    18    8779     488    6275      6275
>    19    9267     488    6603      6603
>    20    9755     488    6934      6934
>    21   10243     486    7264      7264
>    22   10729     488    7601      7601
>    23   11217     488    7929      7929
>    24   11705     488    8260      8260
>    25   12193     486    8590      8590
>    26   12679     488    8841      8841
>    27   13167     489    9004      9004
>    28   13656     489    9167      9167
>    29   14145     489    9330      9330
>    30   14634     489    9493      9493
>    31   15123     489    9656      9656
>    32   15612     489    9819      9819
>    33   16101     489    9982      9982
>    34   16590     489   10145     10145
>    35   17079     489   10308     10308
>    36   17568     489   10471     10471
>    37   18057     489   10634     10634
>    38   18546     489   10797     10797
>    39   19035     489   10960     10960
>    40   19524     489   11123     11123
>    41   20013     489   11286     11286
>    42   20502     489   11449     11449
>    43   20991     489   11612     11612
>    44   21480     489   11775     11775
>    45   21969     489   11938     11938
>    46   22458     489   12101     12101
>    47   22947     489   12264     12264
>    48   23436     489   12427     12427
>    49   23925     489   12590     12590
>    50   24414     489   12753     12753
>    51   24903     489   12916     12916
>    52   25392     489   13079     13079
>    53   25881     486   13241     13241
>    54   26367     489   13404     13404
>    55   26856     486   13566     13566
>    56   27342     489   13729     13729
>    57   27831     486   13891     13891
>    58   28317     489   14054     14054
>    59   28806     486   14216     14216
>    60   29292     489   14379     14379
>    61   29781     486   14541     14541
>    62   30267     489   14704     14704
>    63   30756     486   14866     14866
> 
> parameters of the run:
>   integrator           = md
>   nsteps               = 100000
>   init_step            = 0
>   ns_type              = Grid
>   nstlist              = 10
>   ndelta               = 2
>   bDomDecomp           = FALSE
>   decomp_dir           = 0
>   nstcomm              = 1
>   comm_mode            = Linear
>   nstcheckpoint        = 1000
>   nstlog               = 100
>   nstxout              = 1000
>   nstvout              = 25000
>   nstfout              = 0
>   nstenergy            = 100
>   nstxtcout            = 500
>   init_t               = 0
>   delta_t              = 0.002
>   xtcprec              = 1000
>   nkx                  = 64
>   nky                  = 128
>   nkz                  = 64
>   pme_order            = 4
>   ewald_rtol           = 1e-05
>   ewald_geometry       = 0
>   epsilon_surface      = 0
>   optimize_fft         = TRUE
>   ePBC                 = xyz
>   bUncStart            = FALSE
>   bShakeSOR            = FALSE
>   etc                  = Nose-Hoover
>   epc                  = Parrinello-Rahman
>   epctype              = Isotropic
>   tau_p                = 5
>   ref_p (3x3):
>      ref_p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
>      ref_p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
>      ref_p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
>   compress (3x3):
>      compress[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
>      compress[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
>      compress[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
>   andersen_seed        = 815131
>   rlist                = 1
>   coulombtype          = PME
>   rcoulomb_switch      = 0
>   rcoulomb             = 1
>   vdwtype              = Cut-off
>   rvdw_switch          = 0
>   rvdw                 = 1
>   epsilon_r            = 1
>   epsilon_rf           = 1
>   tabext               = 1
>   gb_algorithm         = Still
>   nstgbradii           = 1
>   rgbradii             = 2
>   gb_saltconc          = 0
>   implicit_solvent     = No
>   DispCorr             = No
>   fudgeQQ              = 1
>   free_energy          = no
>   init_lambda          = 0
>   sc_alpha             = 0
>   sc_power             = 0
>   sc_sigma             = 0.3
>   delta_lambda         = 0
>   disre_weighting      = Conservative
>   disre_mixed          = FALSE
>   dr_fc                = 1000
>   dr_tau               = 0
>   nstdisreout          = 100
>   orires_fc            = 0
>   orires_tau           = 0
>   nstorireout          = 100
>   dihre-fc             = 1000
>   dihre-tau            = 0
>   nstdihreout          = 100
>   em_stepsize          = 0.001
>   em_tol               = 1e-06
>   niter                = 1000
>   fc_stepsize          = 0
>   nstcgsteep           = 10000
>   nbfgscorr            = 10
>   ConstAlg             = Lincs
>   shake_tol            = 0.0001
>   lincs_order          = 4
>   lincs_warnangle      = 30
>   lincs_iter           = 1
>   bd_fric              = 0
>   ld_seed              = 1993
>   cos_accel            = 0
>   deform (3x3):
>      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>   userint1             = 0
>   userint2             = 0
>   userint3             = 0
>   userint4             = 0
>   userreal1            = 0
>   userreal2            = 0
>   userreal3            = 0
>   userreal4            = 0
> grpopts:
>   nrdf:           75399
>   ref_t:             300
>   tau_t:             0.5
> anneal:                  No
> ann_npoints:               0
>   acc:               0           0           0
>   nfreeze:           N           N           N
>   energygrp_flags[  0]: 0 0 0
>   energygrp_flags[  1]: 0 0 0
>   energygrp_flags[  2]: 0 0 0
>   efield-x:
>      n = 0
>   efield-xt:
>      n = 0
>   efield-y:
>      n = 0
>   efield-yt:
>      n = 0
>   efield-z:
>      n = 0
>   efield-zt:
>      n = 0
>   bQMMM                = FALSE
>   QMconstraints        = 0
>   QMMMscheme           = 0
>   scalefactor          = 1
> qm_opts:
>   ngQM                 = 0
> Max number of graph edges per atom is 4
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw:     FALSE
> Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
> Cut-off's:   NS: 1   Coulomb: 1   LJ: 1
> System total charge: 0.000
> Generated table with 1000 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
> 
> Enabling SPC water optimization for 6108 molecules.
> 
> Will do PME sum in reciprocal space.
> [End]
> --------------------------------------------------------------------------
> ERROR MESSAGE:
> 
> Reading file topol.tpr, VERSION 3.3.1 (single precision)
> 
> Back Off! I just backed up ener.edr to ./#ener.edr.1#
> starting mdrun 'sivdppc'
> 100000 steps,    200.0 ps.
> 
> 
> Back Off! I just backed up traj.trr to ./#traj.trr.1#
> 
> Back Off! I just backed up traj.xtc to ./#traj.xtc.1#
> 
> Back Off! I just backed up step-1.pdb to ./#step-1.pdb.1#
> ERROR: 0031-250  task 62: Segmentation fault
> ERROR: 0031-250  task 54: Segmentation fault
> ERROR: 0031-250  task 58: Segmentation fault
> ERROR: 0031-250  task 50: Segmentation fault
> ERROR: 0031-250  task 51: Segmentation fault
> 
> Back Off! I just backed up step0.pdb to ./#step0.pdb.1#
> ERROR: 0031-250  task 19: Segmentation fault
> ERROR: 0031-250  task 28: Segmentation fault
> ERROR: 0031-250  task 49: Segmentation fault
> ERROR: 0031-250  task 17: Segmentation fault
> ERROR: 0031-250  task 20: Segmentation fault
> ERROR: 0031-250  task 23: Segmentation fault
> ERROR: 0031-250  task 26: Segmentation fault
> ERROR: 0031-250  task 27: Segmentation fault
> ERROR: 0031-250  task 31: Segmentation fault
> Wrote pdb files with previous and current coordinates
> ERROR: 0031-250  task 52: Segmentation fault
> ERROR: 0031-250  task 18: Segmentation fault
> ERROR: 0031-250  task 60: Segmentation fault
> ERROR: 0031-250  task 24: Segmentation fault
> ERROR: 0031-250  task 16: Segmentation fault
> ERROR: 0031-250  task 30: Segmentation fault
> ERROR: 0031-250  task 21: Segmentation fault
> ERROR: 0031-250  task 14: Segmentation fault
> ERROR: 0031-250  task 48: Segmentation fault
> ERROR: 0031-250  task 38: Segmentation fault
> ERROR: 0031-250  task 22: Segmentation fault
> ERROR: 0031-250  task 46: Segmentation fault
> ERROR: 0031-250  task 3: Segmentation fault
> ERROR: 0031-250  task 45: Segmentation fault
> ERROR: 0031-250  task 37: Segmentation fault
> ERROR: 0031-250  task 40: Segmentation fault
> ERROR: 0031-250  task 8: Segmentation fault
> ERROR: 0031-250  task 15: Segmentation fault
> ERROR: 0031-250  task 33: Segmentation fault
> ERROR: 0031-250  task 39: Segmentation fault
> ERROR: 0031-250  task 44: Segmentation fault
> ERROR: 0031-250  task 56: Segmentation fault
> ERROR: 0031-250  task 43: Segmentation fault
> ERROR: 0031-250  task 4: Segmentation fault
> ERROR: 0031-250  task 12: Segmentation fault
> ERROR: 0031-250  task 29: Segmentation fault
> ERROR: 0031-250  task 35: Segmentation fault
> ERROR: 0031-250  task 25: Segmentation fault
> ERROR: 0031-250  task 6: Segmentation fault
> ERROR: 0031-250  task 42: Segmentation fault
> ERROR: 0031-250  task 13: Segmentation fault
> ERROR: 0031-250  task 1: Segmentation fault
> ERROR: 0031-250  task 9: Segmentation fault
> ERROR: 0031-250  task 10: Segmentation fault
> ERROR: 0031-250  task 2: Segmentation fault
> ERROR: 0031-250  task 47: Segmentation fault
> ERROR: 0031-250  task 5: Segmentation fault
> ERROR: 0031-250  task 7: Segmentation fault
> ERROR: 0031-250  task 11: Segmentation fault
> ERROR: 0031-250  task 32: Segmentation fault
> ERROR: 0031-250  task 34: Segmentation fault
> ERROR: 0031-250  task 41: Segmentation fault
> ERROR: 0031-250  task 36: Segmentation fault
> ERROR: 0031-250  task 55: Terminated
> ERROR: 0031-250  task 59: Terminated
> ERROR: 0031-250  task 53: Terminated
> ERROR: 0031-250  task 57: Terminated
> ERROR: 0031-250  task 61: Terminated
> ERROR: 0031-250  task 63: Terminated
> ERROR: 0031-250  task 0: Terminated
> [End]
> 
> _________________________________________________________________
> Find a local pizza place, music store, museum and more…then map the best 
> route!  http://local.live.com
> 
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the www 
> interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
you have ncpu = nkx (number of grid points)
IIRC a bug has been fixed in the CVS for ngridpoint/cpu < 2, so it may 
work better in the CVS version.

-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



More information about the gromacs.org_gmx-users mailing list