[gmx-users] mdrun_mpi seg fault if N_atoms/cpu > 4096 ?

David spoel at xray.bmc.uu.se
Wed Nov 9 20:31:46 CET 2005


On Wed, 2005-11-09 at 18:46 +0200, Atte Sillanpää wrote:
> Hi,
> 
> we have a system with 128 DPPC-molecules and a layer of water. All goes 
> well with version 3.2.1 if the number of atoms per cpu is less than 4096. 
> That is, we get a seg fault before any real md in the beginning:

This could be the old bug when you have no water on the first processor.
This has been fixed in 3.3, but you can also use the shuffle option as a
workaround (will also give you better performance)

> Parallelized PME sum used.
> Using the FFTW library (Fastest Fourier Transform in the West)
> PARALLEL FFT DATA:
>     local_nx:                  16  local_x_start:                   0
>     local_ny_after_transpose:  16  local_y_start_after_transpose    0
>     total_local_size:         67584
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>    0:  rest, initial mass: 159751
> There are: 4341 Atom
> Removing pbc first time
> Done rmpbc
> Started mdrun on node 0 Wed Nov  9 17:27:13 2005
> Initial temperature: 320.01 K
>             Step           Time         Lambda
>                0        0.00000        0.00000
> 
> However, with 8 cpu:s there's no problem. We get this on an Opteron 
> cluster running Rocks 3.2.0, Power4 and Sun Fire 25k. (Crash with 16384 
> atoms, but not with 16381) Also, there were no problems with Gromacs 
> version 3.0.3.
> 
> There should not be anything special in the *.mdp-file and it's parameters 
> seemed not to influence the behaviour. Hasty analysis from the Sun Fire 
> 25k core gives out the following:
> 
> dbx -f mdrun_mpi core
> 
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.3' 
> in your .dbxrc
> Reading mdrun_mpi
> dbx: internal warning: writable memory segment 0xbcc00000[21331968] of 
> size 0 in core
> core file header read successfully
> Reading ld.so.1
> Reading libmpi.so.1
> ...
> Reading libdoor.so.1
> Reading tcppm.so.2
> t at 1 (l at 1) program terminated by signal SEGV (no mapping at the fault 
> address)
> 0x000aa43c: pbc_rvec_sub+0x001c:        ld       [%o0 + 8], %f8
> 
> Any ideas on how to proceed? I'm sure people have done bigger systems with 
> gmx per cpu?
> 
> Cheers,
> 
> Atte
> _______________________________________________
> gmx-users mailing list
> gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
-- 
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,          75124 Uppsala, Sweden
phone:  46 18 471 4205          fax: 46 18 511 755
spoel at xray.bmc.uu.se    spoel at gromacs.org   http://xray.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




More information about the gromacs.org_gmx-users mailing list