Re: [gmx-users] “Fatal error in PMPI_Bcast: Other MPI error, …..” occurs when using the ‘particle decomposition’ option.

Mark Abraham mark.abraham at anu.edu.au
Tue Jun 1 04:29:24 CEST 2010


----- Original Message -----
From: xhomes at sohu.com
Date: Tuesday, June 1, 2010 11:53
Subject: [gmx-users] “Fatal error in PMPI_Bcast: Other MPI error, …..” occurs when using the ‘particle decomposition’ option.
To: gmx-users <gmx-users at gromacs.org>

> Hi, everyone of gmx-users,
> 
> I met a problem when I use the ‘particle decomposition’ option 
> in a NTP MD simulation of Engrailed Homeodomain (En) in CL- 
> neutralized water box. It just crashed with an error “Fatal 
> error in PMPI_Bcast: Other MPI error, error stack: …..”. 
> However, I’ve tried the ‘domain decomposition’ and everything is 
> ok! I use the Gromacs 4.05 and 4.07, the MPI lib is mpich2-
> 1.2.1p1. The system box size is 5.386(nm)3. The MDP file list as 
> below:
> ########################################################
> title                    = En
> ;cpp                      = /lib/cpp
> ;include                  = -I../top
> define                   = 
> integrator               = md
> dt                       = 0.002
> nsteps                   = 3000000
> nstxout                  = 500
> nstvout                  = 500
> nstlog                   = 250
> nstenergy                = 250
> nstxtcout                 = 500
> comm-
> mode              = Linear
> nstcomm                  = 1
> 
> ;xtc_grps                 = Protein
> energygrps               = protein non-protein
> 
> nstlist                  = 10
> ns_type                  = grid
> pbc                      = xyz	;default xyz
> ;periodic_molecules       = 
> yes	;default no
> rlist                    = 1.0
> 
> coulombtype              = PME
> rcoulomb                 = 1.0
> vdwtype                  = Cut-off
> rvdw                     = 1.4
> fourierspacing           = 0.12
> fourier_nx               = 0
> fourier_ny               = 0
> fourier_nz               = 0
> pme_order                = 4
> ewald_rtol               = 1e-5
> optimize_fft             = yes
> 
> tcoupl                   = v-rescale
> tc_grps                  = protein non-protein
> tau_t                    = 0.1  0.1
> ref_t                    = 298  298
> Pcoupl                   = Parrinello-Rahman
> pcoupltype               = isotropic
> tau_p                    = 0.5
> compressibility          = 4.5e-5
> ref_p                    = 1.0
> 
> gen_vel                  = yes
> gen_temp                 = 298
> gen_seed                 = 173529
> 
> constraints              = hbonds
> lincs_order              = 10
> ########################################################
> 
> When I conduct MD using “nohup mpiexec -np 2 mdrun_dmpi -s 
> 11_Trun.tpr -g 12_NTPmd.log -o 12_NTPmd.trr -c 12_NTPmd.pdb -e 
> 12_NTPmd_ener.edr -cpo 12_NTPstate.cpt &”, everything is OK.
> 
> Since the system doesn’t support more than 2 processes under 
> ‘domain decomposition’ option, it took me about 30 days to 
> calculate a 6ns trajectory. Then I decide to use the ‘particle 

Why no more than 2? What GROMACS version? Why are you using double precision with temperature coupling?

MPICH has known issues. Use OpenMPI.

> decomposition’ option. The command line is “nohup mpiexec -np 6 
> mdrun_dmpi -pd -s 11_Trun.tpr -g 12_NTPmd.log -o 12_NTPmd.trr -c 
> 12_NTPmd.pdb -e 12_NTPmd_ener.edr -cpo 12_NTPstate.cpt &”. And I 
> got the crash in the nohup file like below:
> ####################
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1302)......................: MPI_Bcast(buf=0x8fedeb0, 
> count=60720, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast(998).......................: 
> MPIR_Bcast_scatter_ring_allgather(842): 
> MPIR_Bcast_binomial(187)..............: 
> MPIC_Send(41).........................: 
> MPIC_Wait(513)........................: 
> MPIDI_CH3I_Progress(150)..............: 
> MPID_nem_mpich2_blocking_recv(948)....: 
> MPID_nem_tcp_connpoll(1720)...........: 
> state_commrdy_handler(1561)...........: 
> MPID_nem_tcp_send_queued(127).........: writev to socket failed -
> Bad address
> rank 0 in job 25  cluster.cn_52655   caused 
> collective abort of all ranks
> exit status of rank 0: killed by signal 9
> ####################
> 
> And the ends of the log file list as below:
> ####################
> ……..
> ……..
> ……..
> ……..
>    
> bQMMM            = FALSE
>    
> QMconstraints        = 0
>    QMMMscheme       = 0
>    
> scalefactor           = 1
> qm_opts:
>    
> ngQM                 = 0
> ####################
> 
> I’ve search the gmx-users mail list and tried to adjust the md 
> parameters, and no solution was found. The "mpiexec -np x" 
> option doesn't work except when x=1. I did found that when the 
> whole En protein is constrained using position restraints 
> (define = -DPOSRES), the ‘particle decomposition’ option works. 
> However this is not the kind of MD I want to conduct.
>  
> Could anyone help me about this problem? And I also want to know 
> how can I accelerate this kind of MD (long time simulation of 
> small system) using Gromacs? Thinks a lot!
> 
> (Further information about the simulated system: The system has 
> one En protein (54 residues, 629 atoms), total 4848 spce waters, 
> and 7 Cl- used to neutralize the system. The system has been 
> minimized first. A 20ps MD is also performed for the waters and 
> ions before EM.)

This should be bread-and-butter with either decomposition up to at least 16 processors, for a correctly compiled GROMACS with a useful MPI library.

Mark



More information about the gromacs.org_gmx-users mailing list