[gmx-users] Simulations on Leopard - slow and crashing, cannot compile with lam-mpi

Hadas Leonov hleonov at cc.huji.ac.il
Wed Nov 21 11:24:35 CET 2007


Hi everybody,

I have installed Gromacs 3.3.2 on Mac OSX Leopard, but now it it  
working 3 times slower than it did before, and in addition, it crashes  
on  simulations that take more than 40 minutes.

For example, I've ran a few benchmark runs and here are the results  
for 4 processors on Mac-Pro:
d.villin: 	
Leopard performance: 13714 ps/day
old OS performance:    41143 ps/day.
gmx-benchmark : 	 48000 ps/day.

d.poly-ch2
Leopard performance:  8640 ps/day
old OS performance:    18000 ps/day
gmx-benchmark: 		  20571 ps/day

At first I thought that open-mpi was responsible for the slow speed,  
but even when running on 1 CPU with mdrun, the performance of d.villin  
were 3592 ps/day on Leopard, in comparison to 18106 ps/day on the old  
OS.

As for crashes: I ran a position restraints of 0.5ns which usually  
takes 2 hours on 2 CPUs.  The prediction of the finish time was 6  
hours, but it crashed after 40 minutes with the following errors:
---
step 23070, will finish at Tue Nov 20 23:18:28 2007
[tmdec2:69924] *** Process received signal ***
[tmdec2:69924] Signal: Segmentation fault (11)
[tmdec2:69924] Signal code: Address not mapped (1)
[tmdec2:69924] Failing at address: 0x49c78d52
[tmdec2:69925] *** Process received signal ***
[tmdec2:69925] Signal: Segmentation fault (11)
[tmdec2:69925] Signal code: Address not mapped (1)
[tmdec2:69925] Failing at address: 0x49aeac55
[tmdec2:69926] *** Process received signal ***
[tmdec2:69926] Signal: Segmentation fault (11)
[tmdec2:69926] Signal code: Address not mapped (1)
[tmdec2:69926] Failing at address: 0x48c74d8c
[tmdec2:69927] *** Process received signal ***
[tmdec2:69927] Signal: Segmentation fault (11)
[tmdec2:69927] Signal code: Address not mapped (1)
[tmdec2:69927] Failing at address: 0x49e5e700
[ 1] [0xbfffd678, 0x49aeac55] (-P-)
[ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] [ 1] [0xbfffd678, 0x48c74d8c] (-P-)
[ 2] [ 1] [0xbfffd678, 0x49c78d52] (-P-)
[ 2] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] [ 1] [0xbfffd678, 0x49e5e700] (-P-)
[ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
[ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
[ 4] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) (mca_coll_basic_alltoallv_intra + 0x28b)  
[0xbfffd7c8, 0x00a3a65b]
[ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
[ 6] [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
[ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
[ 8] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69925] *** End of error message ***
(do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] [0xbfffdc88, 0x0002ee56]
[ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
[10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] [0xbfffdd78, 0x0005d652]
[10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
[11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69926] *** End of error message ***
(mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
[12] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (main + 0x463) [0xbfffeb98, 0x00018c69]
[13] (start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69924] *** End of error message ***
(start + 0x36) [0xbfffebbc, 0x0000216e]
[14] [0x00000000, 0x0000000e] (FP-)
[tmdec2:69927] *** End of error message ***
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/ 
pls_base_orted_cmds.c at line 275
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/ 
pls_rsh_module.c at line 1164
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/errmgr/hnp/errmgr_hnp.c  
at line 90
mpirun noticed that job rank 1 with PID 69925 on node  
tmdec2.ls.huji.ac.il exited on signal 11 (Segmentation fault).
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/ 
pls_base_orted_cmds.c at line 188
[tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/ 
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.  
Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------
1 additional process aborted (not shown)
---

It looks like GROMACS has troubles with open-mpi.
Before the installation of Leopard I was using lam-mpi, I couldn't use  
it now because the compilation did not work unless I installed open- 
mpi and used the ia32 disable flags in the configure script. (-- 
disable-ia32-3dnow --disable-ia32-3dnow, see previous post re that  
error).

When I tried to compile GROMACS with lam-mpi installed, this is the  
'make' error I got:
---
mpicc -I/sw/include -framework Accelerate -o grompp topio.o toppush.o  
topcat.o topshake.o convparm.o tomorse.o sorting.o splitter.o  
vsite_parm.o readir.o add_par.o topexcl.o toputil.o topdirs.o grompp.o  
compute_io.o  -L/sw/lib ../mdlib/.libs/libmd_mpi.a -L/usr/X11/lib ../ 
gmxlib/.libs/libgmx_mpi.a /usr/local/lib/libfftw3f.a -lm /sw/lib/ 
libXm.dylib /usr/X11/lib/libXt.6.0.0.dylib /usr/X11/lib/libSM. 
6.0.0.dylib /usr/X11/lib/libICE.6.3.0.dylib /usr/X11/lib/libXp. 
6.2.0.dylib /usr/X11/lib/libXext.6.4.0.dylib /usr/X11/lib/ 
libX11.6.2.0.dylib /usr/X11/lib/libXau.6.0.0.dylib /usr/X11/lib/ 
libXdmcp.6.0.0.dylib
Undefined symbols:
   "_lam_mpi_byte", referenced from:
       _lam_mpi_byte$non_lazy_ptr in libgmx_mpi.a(network.o)
   "_lam_mpi_float", referenced from:
       _lam_mpi_float$non_lazy_ptr in libgmx_mpi.a(network.o)
   "_lam_mpi_comm_world", referenced from:
       _lam_mpi_comm_world$non_lazy_ptr in libgmx_mpi.a(network.o)
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[3]: *** [grompp] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all] Error 2
make: *** [all-recursive] Error 1
---

As you can realize - I can't do any simulations for now. any ideas? Is  
this a GROMACS bug? if so, any chance it's fixed soon?

Thanks in advance,
Hadas Leonov.

hleonov at cc.huji.ac.il
Department of Biological Chemistry
Alexander Silberman institute of Life Sciences
The Hebrew University,
Jerusalem, Israel




More information about the gromacs.org_gmx-users mailing list