[gmx-users] Vesicle simulation crashed with dry martini force field

Szilárd Páll pall.szilard at gmail.com
Mon Jan 18 19:52:31 CET 2016


Hi,

While some details that would be useful are missing, already from this it
is clear that you're pushing the DD to its limits. The output "vol min/aver
0.108!" means that the minimum cell size is about 9-10x smaller than the
average and the "!" means that DD is limited, as noted in the Z direction.
Hence, I'd definitely try to increase the number of thread per rank and use
less domains in Z, e.g. already 3 iso 4 should make a significant
difference.

That still does not explain the crash, but I've no experience with Martini
simulations and I'm not sure what could be the reason for such a behavior.
Hopefully others can pitch in.

Cheers,
--
Szilárd

PS: You can also just upload the log to some external storage and share a
link to it. Additionally to the information you pasted in the actual
performance measurements from the end of the file are also of interest.


On Sat, Jan 16, 2016 at 12:30 AM, Shule Liu <shuleliu1985 at yahoo.com> wrote:

> Hi Szilárd,
>
> I've been testing simulation with 10 fs step but less cores (96 cores). It
> runs normally for over 1 million steps so far.
>
> The size of my log file is very large and I cannot attach it here, but I
> can paste here the last few lines before the system crashed.
>
>            Step           Time         Lambda
>         6105780    61057.80000        0.00000
>
>    Energies (kJ/mol)
>            Bond       G96Angle    Proper Dih.  Improper Dih.        LJ (SR)
>     1.37280e+06    7.70904e+05    4.43898e+04    5.11671e+04   -1.90582e+07
>    Coulomb (SR) Position Rest.      Potential    Kinetic En.   Total Energy
>    -2.75356e+05    2.14035e+05   -1.68803e+07    4.67130e+06   -1.22090e+07
>     Temperature Pressure (bar)   Constr. rmsd
>     2.94781e+02   -1.02636e+00    2.66990e-05
>
> DD  load balancing is limited by minimum cell size in dimension Z
> DD  step 6105789  vol min/aver 0.108! load imb.: force 19.4%
>
>            Step           Time         Lambda
>         6105790    61057.90000        0.00000
>
>    Energies (kJ/mol)
>            Bond       G96Angle    Proper Dih.  Improper Dih.        LJ (SR)
>     1.37244e+06    7.70961e+05    4.50497e+04    5.12852e+04   -1.90591e+07
>    Coulomb (SR) Position Rest.      Potential    Kinetic En.   Total Energy
>    -2.75450e+05    2.14950e+05   -1.68799e+07    4.66950e+06   -1.22104e+07
>     Temperature Pressure (bar)   Constr. rmsd
>     2.94668e+02   -1.07720e+00    2.64488e-05
>
> DD  load balancing is limited by minimum cell size in dimension Z
> DD  step 6105799  vol min/aver 0.112! load imb.: force 25.2%
>
>            Step           Time         Lambda
>         6105800    61058.00000        0.00000
>
>    Energies (kJ/mol)
>            Bond       G96Angle    Proper Dih.  Improper Dih.        LJ (SR)
>     1.37125e+06    7.70718e+05    4.50438e+04    5.10340e+04   -1.90598e+07
>    Coulomb (SR) Position Rest.      Potential    Kinetic En.   Total Energy
>    -2.75554e+05    2.15002e+05   -1.68823e+07    4.67354e+06   -1.22088e+07
>     Temperature Pressure (bar)   Constr. rmsd
>     2.94922e+02   -9.62789e-01    2.68380e-05
>
> DD  load balancing is limited by minimum cell size in dimension Z
> DD  step 6105809  vol min/aver 0.107! load imb.: force 28.4%
>
>            Step           Time         Lambda
>         6105810    61058.10000        0.00000
>
>    Energies (kJ/mol)
>            Bond       G96Angle    Proper Dih.  Improper Dih.        LJ (SR)
>     1.37174e+06    7.71903e+05    4.46827e+04    5.07851e+04   -1.90577e+07
>    Coulomb (SR) Position Rest.      Potential    Kinetic En.   Total Energy
>    -2.75575e+05    2.14358e+05   -1.68798e+07    4.67174e+06   -1.22081e+07
>     Temperature Pressure (bar)   Constr. rmsd
>     2.94809e+02   -9.34474e-01    3.14118e-05
>
> There is no error message in the log file when the simulation crashed. The
> error message only shows up in the job output file of the cluster.
>
> From the log file, I can see that the load imbalance is large (more than
> 5%). I think this might have something to do with the inhomogeneity of my
> system. The domain decomposition information at the head of the log file is
>
> Initializing Domain Decomposition on 192 ranks
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Minimum cell size due to bonded interactions: 4.000 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 2.626 nm
> Estimated maximum distance required for P-LINCS: 2.626 nm
> Domain decomposition grid 8 x 6 x 4, separate PME ranks 0
> Domain decomposition rank 0, coordinates 0 0 0
>
> Using two step summing over 8 groups of on average 24.0 ranks
>
> Using 192 MPI processes
>
> The number of constraints is 286516
> There are inter charge-group constraints,
> will communicate selected coordinates each lincs iteration
> 229975 constraints are involved in constraint triangles,
> will apply an additional matrix expansion of order 4 for couplings
> between constraints inside triangles
> Setting the maximum number of constraint warnings to -1
> maxwarn < 0, will not stop on constraint errors
>
> Linking all bonded interactions to atoms
> There are 77859 inter charge-group virtual sites,
> will an extra communication step for selected coordinates and forces
>
> The initial number of communication pulses is: X 1 Y 1 Z 1
> The initial domain decomposition cell size is: X 15.00 nm Y 20.00 nm Z
> 30.00 nm
>
> The maximum allowed distance for charge groups involved in interactions is:
>                  non-bonded interactions           1.400 nm
>             two-body bonded interactions  (-rdd)   4.000 nm
>           multi-body bonded interactions  (-rdd)   4.000 nm
>               virtual site constructions  (-rcon) 15.000 nm
>   atoms separated by up to 5 constraints  (-rcon) 15.000 nm
>
> When dynamic load balancing gets turned on, these settings will change to:
> The maximum number of communication pulses is: X 1 Y 1 Z 1
> The minimum size for domain decomposition cells is 4.000 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.27 Y 0.20 Z 0.13
> The maximum allowed distance for charge groups involved in interactions is:
>                  non-bonded interactions           1.400 nm
>             two-body bonded interactions  (-rdd)   4.000 nm
>           multi-body bonded interactions  (-rdd)   4.000 nm
>               virtual site constructions  (-rcon)  4.000 nm
>   atoms separated by up to 5 constraints  (-rcon)  4.000 nm
>
>
> Making 3D domain decomposition grid 8 x 6 x 4, home cell index 0 0 0
>
> Thanks.
>
> Shule
>
>
>
>
>
>
> On Friday, January 15, 2016 4:41 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>
> Why do you think that it's the domain decomposition load balancing causing
> the crash rather than the time-step? You say you ran successfully on less
> CPU cores with shorter time-step. What about less cores with 10 fs
> time-step?
>
> It would help if you share a less mis-formatted information or even a full
> log file.
>
> --
> Szilárd
>
> On Fri, Jan 15, 2016 at 10:12 PM, Shule Liu <shuleliu1985 at yahoo.com>
> wrote:
>
> Hi,
> I'm trying to simulate a large lipid vesicle ( ~ 100 nm in diameter) with
> the dry martini force field. The system consists of about 1.4 million
> particles. I'm trying to equilibrate the system in NVT ensemble in a
> simulation box with length 120 nm, using the timestep of 10 fs. The
> simulation was running on 144 cores (6 nodes with 24 cores in each node).
> Below is my .mdp input file.
> define                   = -DPOSRES -DPOSRES_FC=1000
> -DBILAYER_LIPIDHEAD_FC=200integrator               = sdtinit
>      = 0.0dt                       = 0.01nsteps                   = 8000000
> nstxout                  = 100000nstvout                  = 10000nstfout
>                = 10000nstlog                   = 10nstenergy
>  = 10000nstxtcout                = 1000xtc_precision            = 100
> nstlist                  = 10ns_type                  = gridpbc
>            = xyzrlist                    = 1.4
> epsilon_r                = 15coulombtype              = Shiftrcoulomb
>             = 1.2vdw_type                 = Shiftrvdw_switch              =
> 0.9rvdw                     = 1.2DispCorr                 = No
> tc-grps                  = systemtau_t                    = 4.0ref_t
>              = 295
> ; Pressure coupling:Pcoupl                   = no
> ; GENERATE VELOCITIES FOR STARTUP RUN:;gen_vel                  =
> yes;gen_temp                 = 295;gen_seed                 =
> 1452274742refcoord_scaling         = allcutoff-scheme            = group
> The simulation crashed with the following error message.
> Step 6105820:Atom 164932 moved more than the distance allowed by the
> domain decomposition (4.000000) in direction Zdistance out of cell
> 127480.656250Old coordinates:   38.785   21.966  103.077New coordinates:
> -477239.938 16192.882 127588.617Old cell boundaries in direction Z:
> 60.580  107.937New cell boundaries in direction Z:   60.632  107.958
> -------------------------------------------------------Program mdrun_mpi,
> VERSION 5.0.4Source code file:
> /scratch/build/git/chemistry-roll/BUILD/sdsc-gromacs-5.0.4/gromacs-5.0.4/src/gromacs/mdlib/domdec.c,
> line: 4390
> Fatal error:An atom moved too far between two domain decomposition
> stepsThis usually means that your system is not well equilibratedFor more
> information and tips for troubleshooting, please check the GROMACSwebsite
> at
> http://www.gromacs.org/Documentation/Errors-------------------------------------------------------
> Error on rank 58, will try to stop all ranksHalting parallel program
> mdrun_mpi on CPU 58 out of 144
> gcq#25: "This Puke Stinks Like Beer" (LIVE)
> [cli_58]: aborting job:application called MPI_Abort(MPI_COMM_WORLD, -1) -
> process 58
> I think my simulation crashed possibly due to the large load imbalance
> generated by the domain decomposition. My system (lipid vesicle with
> implicit solvent) is highly inhomogeneous, therefore the domain
> decomposition algorithm will generate highly inhomogeneous domains with
> some domains empty and some full of particles. I tried to run the
> simulation with less CPUs (96 cores) and smaller timestep (1 fs) and there
> wasn't any problem for over 6 million steps.
> However, I would still like to use more cores and large timestep to
> equilibrate my system. Is there any better way to control the load balance
> and domain decomposition such that I could equilibrate the system more
> efficiently? The dry martini paper said for such kind of vesicle
> simulations domain decomposition scheme should be chosen carefully. Is
> there a guidance for doing so?
> Thanks very much.
> Shule
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
>
>
>
>


More information about the gromacs.org_gmx-users mailing list