[gmx-users] Gromacs slow and crashes on Leopard.

Hadas Leonov hleonov at alum.cs.huji.ac.il
Tue Nov 27 18:28:51 CET 2007


Hi everybody,

I've wrote here before but there was no reply, however this problem is
crucial since I cannot run Gromacs. 

I have installed Gromacs 3.3.2 on Mac OSX Leopard. It did not compile
with lam-mpi, so I installed it with open-mpi. 
The compilation error with lam-mpi was: 

-----
mpicc -I/sw/include -framework Accelerate -o grompp topio.o toppush.o  
topcat.o topshake.o convparm.o tomorse.o sorting.o splitter.o  
vsite_parm.o readir.o add_par.o topexcl.o toputil.o topdirs.o grompp.o  
compute_io.o  -L/sw/lib ../mdlib/.libs/libmd_mpi.a -L/usr/X11/lib ../ 
gmxlib/.libs/libgmx_mpi.a /usr/local/lib/libfftw3f.a -lm /sw/lib/ 
libXm.dylib /usr/X11/lib/libXt.6.0.0.dylib /usr/X11/lib/libSM. 
6.0.0.dylib /usr/X11/lib/libICE.6.3.0.dylib /usr/X11/lib/libXp. 
6.2.0.dylib /usr/X11/lib/libXext.6.4.0.dylib /usr/X11/lib/ 
libX11.6.2.0.dylib /usr/X11/lib/libXau.6.0.0.dylib /usr/X11/lib/ 
libXdmcp.6.0.0.dylib
Undefined symbols:
   "_lam_mpi_byte", referenced from:
       _lam_mpi_byte$non_lazy_ptr in libgmx_mpi.a(network.o)
   "_lam_mpi_float", referenced from:
       _lam_mpi_float$non_lazy_ptr in libgmx_mpi.a(network.o)
   "_lam_mpi_comm_world", referenced from:
       _lam_mpi_comm_world$non_lazy_ptr in libgmx_mpi.a(network.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[3]: *** [grompp] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all] Error 2
make: *** [all-recursive] Error 1
---

After installing with openmpi - I ran some benchmarks for 4 processors
on Mac-Pro:
d.villin:       
Leopard performance: 	13714 ps/day
old OS performance:    41143 ps/day.
gmx-benchmark :          48000 ps/day.

d.poly-ch2
Leopard performance:  8640 ps/day
old OS performance:    18000 ps/day
gmx-benchmark:            20571 ps/day

old OS refers to OSX 10.4.9.
The slow speed also happens when running only on one CPU. d.villin took
6 times slower than usual. So it can't just be open-mpi fault, can it?

Can it be due to compiling gromacs while disabling ia32 optimization?

As for crashes: I ran a position restraints of 0.5ns which usually  
takes 2 hours on 2 CPUs.  The prediction of the finish time was 6  
hours, but it crashed after 40 minutes with the following errors:

--
step 23070, will finish at Tue Nov 20 23:18:28 2007
[tmdec2:69924] *** Process received signal ***
[tmdec2:69924] Signal: Segmentation fault (11)
[tmdec2:69924] Signal code: Address not mapped (1)
[tmdec2:69924] Failing at address: 0x49c78d52
[tmdec2:69925] *** Process received signal ***
[tmdec2:69925] Signal: Segmentation fault (11)
[tmdec2:69925] Signal code: Address not mapped (1)
[tmdec2:69925] Failing at address: 0x49aeac55
[tmdec2:69926] *** Process received signal ***
[tmdec2:69926] Signal: Segmentation fault (11)
[tmdec2:69926] Signal code: Address not mapped (1)
[tmdec2:69926] Failing at address: 0x48c74d8c
[tmdec2:69927] *** Process received signal ***
[tmdec2:69927] Signal: Segmentation fault (11)
[tmdec2:69927] Signal code: Address not mapped (1)
[tmdec2:69927] Failing at address: 0x49e5e700
[ 1] [0xbfffd678, 0x49aeac55] (-P-)
[ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] [ 1] [0xbfffd678, 0x48c74d8c] (-P-)
[ 2] [ 1] [0xbfffd678, 0x49c78d52] (-P-)
[ 2] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] [ 1] [0xbfffd678, 0x49e5e700] (-P-)
[ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) (mca_coll_basic_alltoallv_intra + 0x28b)  
[0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69925] *** End of error message ***
(do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] [0xbfffdd78, 0x0005d652]
[10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69926] *** End of error message ***
(mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69924] *** End of error message ***
(start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69927] *** End of error message ***
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/ 
pls_base_orted_cmds.c at line 275
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/ 
pls_rsh_module.c at line 1164
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/errmgr/hnp/errmgr_hnp.c  
at line 90
mpirun noticed that job rank 1 with PID 69925 on node  
tmdec2.ls.huji.ac.il exited on signal 11 (Segmentation fault).
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/ 
pls_base_orted_cmds.c at line 188
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/ 
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.  
Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------
1 additional process aborted (not shown)
---

So it looks like the problem is with open-mpi, but if I can't compile
with lam, there's not a way of knowing. 

Help? any ideas?

Thanks in advance,
Hadas Leonov.




More information about the gromacs.org_gmx-users mailing list