[gmx-users] GROMACS Parallel Runs
David van der Spoel
spoel at xray.bmc.uu.se
Mon Oct 2 13:57:17 CEST 2006
Sunny wrote:
>> From: David van der Spoel <spoel at xray.bmc.uu.se>
>> Reply-To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> Subject: Re: [gmx-users] GROMACS Parallel Runs
>> Date: Sun, 01 Oct 2006 19:58:48 +0200
>>
>> Sunny wrote:
>>> Hi,
>>>
>>> I am using GROMACS 3.3.1 parallel runs on an AIX supercomputing
>>> system. My simulation can successfully run on 16 and 32 CPUs (as well
>>> as below 16 CPUs). When running on 64 CPUs, however, segmentation
>>> fault occurs in multiple tasks from very beginning of the simulation.
>>> I'd like know what causes the failure and whether there is any
>>> solution to fix the failure.
>>>
>>
>> please supply more details, like system size, PME details etc.
>>
>>
>>> Thanks,
>>>
>>> Sunny
>> David.
>>
>
> Hi all,
>
> Thanks for your replies. The followings are the full configuration info
> of my simulatione found in md0.log and the error message given in the
> .err. I'm sorry for the tedious list.
>
> Many thanks,
>
> Sunny
>
> CONFIGURATION INFO:
>
> CPU= 0, lastcg= 298, targetcg= 7732, myshift= 23
> CPU= 1, lastcg= 633, targetcg= 8066, myshift= 23
> CPU= 2, lastcg= 970, targetcg= 8404, myshift= 23
> CPU= 3, lastcg= 1298, targetcg= 8732, myshift= 23
> CPU= 4, lastcg= 1629, targetcg= 9062, myshift= 24
> CPU= 5, lastcg= 1959, targetcg= 9392, myshift= 25
> CPU= 6, lastcg= 2296, targetcg= 9730, myshift= 26
> CPU= 7, lastcg= 2624, targetcg=10058, myshift= 27
> CPU= 8, lastcg= 2955, targetcg=10388, myshift= 28
> CPU= 9, lastcg= 3285, targetcg=10718, myshift= 29
> CPU= 10, lastcg= 3622, targetcg=11056, myshift= 30
> CPU= 11, lastcg= 3950, targetcg=11384, myshift= 31
> CPU= 12, lastcg= 4281, targetcg=11714, myshift= 32
> CPU= 13, lastcg= 4611, targetcg=12044, myshift= 33
> CPU= 14, lastcg= 4948, targetcg=12382, myshift= 34
> CPU= 15, lastcg= 5276, targetcg=12710, myshift= 35
> CPU= 16, lastcg= 5607, targetcg=13040, myshift= 36
> CPU= 17, lastcg= 5937, targetcg=13370, myshift= 37
> CPU= 18, lastcg= 6274, targetcg=13708, myshift= 38
> CPU= 19, lastcg= 6602, targetcg=14036, myshift= 39
> CPU= 20, lastcg= 6933, targetcg=14366, myshift= 40
> CPU= 21, lastcg= 7263, targetcg=14696, myshift= 41
> CPU= 22, lastcg= 7600, targetcg= 168, myshift= 42
> CPU= 23, lastcg= 7928, targetcg= 496, myshift= 42
> CPU= 24, lastcg= 8259, targetcg= 826, myshift= 42
> CPU= 25, lastcg= 8589, targetcg= 1156, myshift= 42
> CPU= 26, lastcg= 8840, targetcg= 1408, myshift= 42
> CPU= 27, lastcg= 9003, targetcg= 1570, myshift= 41
> CPU= 28, lastcg= 9166, targetcg= 1734, myshift= 41
> CPU= 29, lastcg= 9329, targetcg= 1896, myshift= 40
> CPU= 30, lastcg= 9492, targetcg= 2060, myshift= 40
> CPU= 31, lastcg= 9655, targetcg= 2222, myshift= 39
> CPU= 32, lastcg= 9818, targetcg= 2386, myshift= 39
> CPU= 33, lastcg= 9981, targetcg= 2548, myshift= 38
> CPU= 34, lastcg=10144, targetcg= 2712, myshift= 38
> CPU= 35, lastcg=10307, targetcg= 2874, myshift= 37
> CPU= 36, lastcg=10470, targetcg= 3038, myshift= 37
> CPU= 37, lastcg=10633, targetcg= 3200, myshift= 36
> CPU= 38, lastcg=10796, targetcg= 3364, myshift= 36
> CPU= 39, lastcg=10959, targetcg= 3526, myshift= 35
> CPU= 40, lastcg=11122, targetcg= 3690, myshift= 35
> CPU= 41, lastcg=11285, targetcg= 3852, myshift= 34
> CPU= 42, lastcg=11448, targetcg= 4016, myshift= 34
> CPU= 43, lastcg=11611, targetcg= 4178, myshift= 33
> CPU= 44, lastcg=11774, targetcg= 4342, myshift= 33
> CPU= 45, lastcg=11937, targetcg= 4504, myshift= 32
> CPU= 46, lastcg=12100, targetcg= 4668, myshift= 32
> CPU= 47, lastcg=12263, targetcg= 4830, myshift= 31
> CPU= 48, lastcg=12426, targetcg= 4994, myshift= 31
> CPU= 49, lastcg=12589, targetcg= 5156, myshift= 30
> CPU= 50, lastcg=12752, targetcg= 5320, myshift= 30
> CPU= 51, lastcg=12915, targetcg= 5482, myshift= 29
> CPU= 52, lastcg=13078, targetcg= 5646, myshift= 29
> CPU= 53, lastcg=13240, targetcg= 5808, myshift= 28
> CPU= 54, lastcg=13403, targetcg= 5970, myshift= 28
> CPU= 55, lastcg=13565, targetcg= 6132, myshift= 27
> CPU= 56, lastcg=13728, targetcg= 6296, myshift= 27
> CPU= 57, lastcg=13890, targetcg= 6458, myshift= 26
> CPU= 58, lastcg=14053, targetcg= 6620, myshift= 26
> CPU= 59, lastcg=14215, targetcg= 6782, myshift= 25
> CPU= 60, lastcg=14378, targetcg= 6946, myshift= 25
> CPU= 61, lastcg=14540, targetcg= 7108, myshift= 24
> CPU= 62, lastcg=14703, targetcg= 7270, myshift= 24
> CPU= 63, lastcg=14865, targetcg= 7432, myshift= 23
> nsb->shift = 42, nsb->bshift= 0
> Listing Scalars
> nsb->nodeid: 0
> nsb->nnodes: 64
> nsb->cgtotal: 14866
> nsb->natoms: 31242
> nsb->shift: 42
> nsb->bshift: 0
> Nodeid index homenr cgload workload
> 0 0 488 299 299
> 1 488 491 634 634
> 2 979 488 971 971
> 3 1467 488 1299 1299
> 4 1955 488 1630 1630
> 5 2443 486 1960 1960
> 6 2929 488 2297 2297
> 7 3417 488 2625 2625
> 8 3905 488 2956 2956
> 9 4393 486 3286 3286
> 10 4879 488 3623 3623
> 11 5367 488 3951 3951
> 12 5855 488 4282 4282
> 13 6343 486 4612 4612
> 14 6829 488 4949 4949
> 15 7317 488 5277 5277
> 16 7805 488 5608 5608
> 17 8293 486 5938 5938
> 18 8779 488 6275 6275
> 19 9267 488 6603 6603
> 20 9755 488 6934 6934
> 21 10243 486 7264 7264
> 22 10729 488 7601 7601
> 23 11217 488 7929 7929
> 24 11705 488 8260 8260
> 25 12193 486 8590 8590
> 26 12679 488 8841 8841
> 27 13167 489 9004 9004
> 28 13656 489 9167 9167
> 29 14145 489 9330 9330
> 30 14634 489 9493 9493
> 31 15123 489 9656 9656
> 32 15612 489 9819 9819
> 33 16101 489 9982 9982
> 34 16590 489 10145 10145
> 35 17079 489 10308 10308
> 36 17568 489 10471 10471
> 37 18057 489 10634 10634
> 38 18546 489 10797 10797
> 39 19035 489 10960 10960
> 40 19524 489 11123 11123
> 41 20013 489 11286 11286
> 42 20502 489 11449 11449
> 43 20991 489 11612 11612
> 44 21480 489 11775 11775
> 45 21969 489 11938 11938
> 46 22458 489 12101 12101
> 47 22947 489 12264 12264
> 48 23436 489 12427 12427
> 49 23925 489 12590 12590
> 50 24414 489 12753 12753
> 51 24903 489 12916 12916
> 52 25392 489 13079 13079
> 53 25881 486 13241 13241
> 54 26367 489 13404 13404
> 55 26856 486 13566 13566
> 56 27342 489 13729 13729
> 57 27831 486 13891 13891
> 58 28317 489 14054 14054
> 59 28806 486 14216 14216
> 60 29292 489 14379 14379
> 61 29781 486 14541 14541
> 62 30267 489 14704 14704
> 63 30756 486 14866 14866
>
> parameters of the run:
> integrator = md
> nsteps = 100000
> init_step = 0
> ns_type = Grid
> nstlist = 10
> ndelta = 2
> bDomDecomp = FALSE
> decomp_dir = 0
> nstcomm = 1
> comm_mode = Linear
> nstcheckpoint = 1000
> nstlog = 100
> nstxout = 1000
> nstvout = 25000
> nstfout = 0
> nstenergy = 100
> nstxtcout = 500
> init_t = 0
> delta_t = 0.002
> xtcprec = 1000
> nkx = 64
> nky = 128
> nkz = 64
> pme_order = 4
> ewald_rtol = 1e-05
> ewald_geometry = 0
> epsilon_surface = 0
> optimize_fft = TRUE
> ePBC = xyz
> bUncStart = FALSE
> bShakeSOR = FALSE
> etc = Nose-Hoover
> epc = Parrinello-Rahman
> epctype = Isotropic
> tau_p = 5
> ref_p (3x3):
> ref_p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
> ref_p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
> ref_p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
> compress (3x3):
> compress[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
> compress[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
> compress[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
> andersen_seed = 815131
> rlist = 1
> coulombtype = PME
> rcoulomb_switch = 0
> rcoulomb = 1
> vdwtype = Cut-off
> rvdw_switch = 0
> rvdw = 1
> epsilon_r = 1
> epsilon_rf = 1
> tabext = 1
> gb_algorithm = Still
> nstgbradii = 1
> rgbradii = 2
> gb_saltconc = 0
> implicit_solvent = No
> DispCorr = No
> fudgeQQ = 1
> free_energy = no
> init_lambda = 0
> sc_alpha = 0
> sc_power = 0
> sc_sigma = 0.3
> delta_lambda = 0
> disre_weighting = Conservative
> disre_mixed = FALSE
> dr_fc = 1000
> dr_tau = 0
> nstdisreout = 100
> orires_fc = 0
> orires_tau = 0
> nstorireout = 100
> dihre-fc = 1000
> dihre-tau = 0
> nstdihreout = 100
> em_stepsize = 0.001
> em_tol = 1e-06
> niter = 1000
> fc_stepsize = 0
> nstcgsteep = 10000
> nbfgscorr = 10
> ConstAlg = Lincs
> shake_tol = 0.0001
> lincs_order = 4
> lincs_warnangle = 30
> lincs_iter = 1
> bd_fric = 0
> ld_seed = 1993
> cos_accel = 0
> deform (3x3):
> deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
> userint1 = 0
> userint2 = 0
> userint3 = 0
> userint4 = 0
> userreal1 = 0
> userreal2 = 0
> userreal3 = 0
> userreal4 = 0
> grpopts:
> nrdf: 75399
> ref_t: 300
> tau_t: 0.5
> anneal: No
> ann_npoints: 0
> acc: 0 0 0
> nfreeze: N N N
> energygrp_flags[ 0]: 0 0 0
> energygrp_flags[ 1]: 0 0 0
> energygrp_flags[ 2]: 0 0 0
> efield-x:
> n = 0
> efield-xt:
> n = 0
> efield-y:
> n = 0
> efield-yt:
> n = 0
> efield-z:
> n = 0
> efield-zt:
> n = 0
> bQMMM = FALSE
> QMconstraints = 0
> QMMMscheme = 0
> scalefactor = 1
> qm_opts:
> ngQM = 0
> Max number of graph edges per atom is 4
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw: FALSE
> Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
> Cut-off's: NS: 1 Coulomb: 1 LJ: 1
> System total charge: 0.000
> Generated table with 1000 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 500 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
>
> Enabling SPC water optimization for 6108 molecules.
>
> Will do PME sum in reciprocal space.
> [End]
> --------------------------------------------------------------------------
> ERROR MESSAGE:
>
> Reading file topol.tpr, VERSION 3.3.1 (single precision)
>
> Back Off! I just backed up ener.edr to ./#ener.edr.1#
> starting mdrun 'sivdppc'
> 100000 steps, 200.0 ps.
>
>
> Back Off! I just backed up traj.trr to ./#traj.trr.1#
>
> Back Off! I just backed up traj.xtc to ./#traj.xtc.1#
>
> Back Off! I just backed up step-1.pdb to ./#step-1.pdb.1#
> ERROR: 0031-250 task 62: Segmentation fault
> ERROR: 0031-250 task 54: Segmentation fault
> ERROR: 0031-250 task 58: Segmentation fault
> ERROR: 0031-250 task 50: Segmentation fault
> ERROR: 0031-250 task 51: Segmentation fault
>
> Back Off! I just backed up step0.pdb to ./#step0.pdb.1#
> ERROR: 0031-250 task 19: Segmentation fault
> ERROR: 0031-250 task 28: Segmentation fault
> ERROR: 0031-250 task 49: Segmentation fault
> ERROR: 0031-250 task 17: Segmentation fault
> ERROR: 0031-250 task 20: Segmentation fault
> ERROR: 0031-250 task 23: Segmentation fault
> ERROR: 0031-250 task 26: Segmentation fault
> ERROR: 0031-250 task 27: Segmentation fault
> ERROR: 0031-250 task 31: Segmentation fault
> Wrote pdb files with previous and current coordinates
> ERROR: 0031-250 task 52: Segmentation fault
> ERROR: 0031-250 task 18: Segmentation fault
> ERROR: 0031-250 task 60: Segmentation fault
> ERROR: 0031-250 task 24: Segmentation fault
> ERROR: 0031-250 task 16: Segmentation fault
> ERROR: 0031-250 task 30: Segmentation fault
> ERROR: 0031-250 task 21: Segmentation fault
> ERROR: 0031-250 task 14: Segmentation fault
> ERROR: 0031-250 task 48: Segmentation fault
> ERROR: 0031-250 task 38: Segmentation fault
> ERROR: 0031-250 task 22: Segmentation fault
> ERROR: 0031-250 task 46: Segmentation fault
> ERROR: 0031-250 task 3: Segmentation fault
> ERROR: 0031-250 task 45: Segmentation fault
> ERROR: 0031-250 task 37: Segmentation fault
> ERROR: 0031-250 task 40: Segmentation fault
> ERROR: 0031-250 task 8: Segmentation fault
> ERROR: 0031-250 task 15: Segmentation fault
> ERROR: 0031-250 task 33: Segmentation fault
> ERROR: 0031-250 task 39: Segmentation fault
> ERROR: 0031-250 task 44: Segmentation fault
> ERROR: 0031-250 task 56: Segmentation fault
> ERROR: 0031-250 task 43: Segmentation fault
> ERROR: 0031-250 task 4: Segmentation fault
> ERROR: 0031-250 task 12: Segmentation fault
> ERROR: 0031-250 task 29: Segmentation fault
> ERROR: 0031-250 task 35: Segmentation fault
> ERROR: 0031-250 task 25: Segmentation fault
> ERROR: 0031-250 task 6: Segmentation fault
> ERROR: 0031-250 task 42: Segmentation fault
> ERROR: 0031-250 task 13: Segmentation fault
> ERROR: 0031-250 task 1: Segmentation fault
> ERROR: 0031-250 task 9: Segmentation fault
> ERROR: 0031-250 task 10: Segmentation fault
> ERROR: 0031-250 task 2: Segmentation fault
> ERROR: 0031-250 task 47: Segmentation fault
> ERROR: 0031-250 task 5: Segmentation fault
> ERROR: 0031-250 task 7: Segmentation fault
> ERROR: 0031-250 task 11: Segmentation fault
> ERROR: 0031-250 task 32: Segmentation fault
> ERROR: 0031-250 task 34: Segmentation fault
> ERROR: 0031-250 task 41: Segmentation fault
> ERROR: 0031-250 task 36: Segmentation fault
> ERROR: 0031-250 task 55: Terminated
> ERROR: 0031-250 task 59: Terminated
> ERROR: 0031-250 task 53: Terminated
> ERROR: 0031-250 task 57: Terminated
> ERROR: 0031-250 task 61: Terminated
> ERROR: 0031-250 task 63: Terminated
> ERROR: 0031-250 task 0: Terminated
> [End]
>
> _________________________________________________________________
> Find a local pizza place, music store, museum and more…then map the best
> route! http://local.live.com
>
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
you have ncpu = nkx (number of grid points)
IIRC a bug has been fixed in the CVS for ngridpoint/cpu < 2, so it may
work better in the CVS version.
--
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596, 75124 Uppsala, Sweden
phone: 46 18 471 4205 fax: 46 18 511 755
spoel at xray.bmc.uu.se spoel at gromacs.org http://folding.bmc.uu.se
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the gromacs.org_gmx-users
mailing list