[gmx-users] Simulations on Leopard - slow and crashing, cannot compile with lam-mpi
Hadas Leonov
hleonov at cc.huji.ac.il
Wed Nov 21 11:27:54 CET 2007
One more thing I forgot to add: I am using an installation of gfortran
for Leopard as my fortran compiler (http://hpc.sourceforge.net/).
After defining F77 and FC as environment variables that point to /usr/
local/bin/gfortran, the configure script identified them.
On Nov 21, 2007, at 12:24 PM, Hadas Leonov wrote:
> Hi everybody,
>
> I have installed Gromacs 3.3.2 on Mac OSX Leopard, but now it it
> working 3 times slower than it did before, and in addition, it
> crashes on simulations that take more than 40 minutes.
>
> For example, I've ran a few benchmark runs and here are the results
> for 4 processors on Mac-Pro:
> d.villin:
> Leopard performance: 13714 ps/day
> old OS performance: 41143 ps/day.
> gmx-benchmark : 48000 ps/day.
>
> d.poly-ch2
> Leopard performance: 8640 ps/day
> old OS performance: 18000 ps/day
> gmx-benchmark: 20571 ps/day
>
> At first I thought that open-mpi was responsible for the slow speed,
> but even when running on 1 CPU with mdrun, the performance of
> d.villin were 3592 ps/day on Leopard, in comparison to 18106 ps/day
> on the old OS.
>
> As for crashes: I ran a position restraints of 0.5ns which usually
> takes 2 hours on 2 CPUs. The prediction of the finish time was 6
> hours, but it crashed after 40 minutes with the following errors:
> ---
> step 23070, will finish at Tue Nov 20 23:18:28 2007
> [tmdec2:69924] *** Process received signal ***
> [tmdec2:69924] Signal: Segmentation fault (11)
> [tmdec2:69924] Signal code: Address not mapped (1)
> [tmdec2:69924] Failing at address: 0x49c78d52
> [tmdec2:69925] *** Process received signal ***
> [tmdec2:69925] Signal: Segmentation fault (11)
> [tmdec2:69925] Signal code: Address not mapped (1)
> [tmdec2:69925] Failing at address: 0x49aeac55
> [tmdec2:69926] *** Process received signal ***
> [tmdec2:69926] Signal: Segmentation fault (11)
> [tmdec2:69926] Signal code: Address not mapped (1)
> [tmdec2:69926] Failing at address: 0x48c74d8c
> [tmdec2:69927] *** Process received signal ***
> [tmdec2:69927] Signal: Segmentation fault (11)
> [tmdec2:69927] Signal code: Address not mapped (1)
> [tmdec2:69927] Failing at address: 0x49e5e700
> [ 1] [0xbfffd678, 0x49aeac55] (-P-)
> [ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
> [ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
> [ 4] [ 1] [0xbfffd678, 0x48c74d8c] (-P-)
> [ 2] [ 1] [0xbfffd678, 0x49c78d52] (-P-)
> [ 2] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
> [ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
> [ 6] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
> [ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
> [ 4] [ 1] [0xbfffd678, 0x49e5e700] (-P-)
> [ 2] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
> [ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
> [ 4] (ompi_ddt_copy_content_same_ddt + 0x7d) [0xbfffd6e8, 0x006f562d]
> [ 3] (ompi_ddt_sndrcv + 0x3bf) [0xbfffd748, 0x006fbebf]
> [ 4] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
> [ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
> [ 6] (mca_coll_basic_alltoallv_intra + 0x28b) [0xbfffd7c8, 0x00a3a65b]
> [ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
> [ 6] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
> [ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
> [ 8] (force + 0x7d9) (mca_coll_basic_alltoallv_intra + 0x28b)
> [0xbfffd7c8, 0x00a3a65b]
> [ 5] (MPI_Alltoallv + 0x20a) [0xbfffd858, 0x0070056a]
> [ 6] [0xbfffdc88, 0x0002ee56]
> [ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
> [10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
> [ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
> [ 8] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
> [ 7] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
> [11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
> [12] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
> [ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
> [10] (pmeredist + 0x4e2) [0xbfffd8d8, 0x0004836e]
> [ 7] (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
> [ 8] (force + 0x7d9) (do_pme + 0x494) [0xbfffda38, 0x0004d62b]
> [ 8] (force + 0x7d9) [0xbfffdc88, 0x0002ee56]
> [ 9] (do_force + 0x87a) (main + 0x463) [0xbfffeb98, 0x00018c69]
> [13] (start + 0x36) [0xbfffebbc, 0x0000216e]
> [14] [0x00000000, 0x0000000e] (FP-)
> [tmdec2:69925] *** End of error message ***
> (do_md + 0x164f) [0xbfffe988, 0x0001666e]
> [11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
> [12] [0xbfffdc88, 0x0002ee56]
> [ 9] (do_force + 0x87a) [0xbfffdd78, 0x0005d652]
> [10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
> [11] [0xbfffdd78, 0x0005d652]
> [10] (do_md + 0x164f) [0xbfffe988, 0x0001666e]
> [11] (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
> [12] (main + 0x463) [0xbfffeb98, 0x00018c69]
> [13] (start + 0x36) [0xbfffebbc, 0x0000216e]
> [14] [0x00000000, 0x0000000e] (FP-)
> [tmdec2:69926] *** End of error message ***
> (mdrunner + 0xb04) [0xbfffeb08, 0x00014abe]
> [12] (main + 0x463) [0xbfffeb98, 0x00018c69]
> [13] (main + 0x463) [0xbfffeb98, 0x00018c69]
> [13] (start + 0x36) [0xbfffebbc, 0x0000216e]
> [14] [0x00000000, 0x0000000e] (FP-)
> [tmdec2:69924] *** End of error message ***
> (start + 0x36) [0xbfffebbc, 0x0000216e]
> [14] [0x00000000, 0x0000000e] (FP-)
> [tmdec2:69927] *** End of error message ***
> [tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in
> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/
> pls_base_orted_cmds.c at line 275
> [tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in
> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/
> pls_rsh_module.c at line 1164
> [tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in
> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/errmgr/hnp/
> errmgr_hnp.c at line 90
> mpirun noticed that job rank 1 with PID 69925 on node
> tmdec2.ls.huji.ac.il exited on signal 11 (Segmentation fault).
> [tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in
> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/base/
> pls_base_orted_cmds.c at line 188
> [tmdec2.ls.huji.ac.il:69921] [0,0,0] ORTE_ERROR_LOG: Timeout in
> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/pls/rsh/
> pls_rsh_module.c at line 1196
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
>
> --------------------------------------------------------------------------
> 1 additional process aborted (not shown)
> ---
>
> It looks like GROMACS has troubles with open-mpi.
> Before the installation of Leopard I was using lam-mpi, I couldn't
> use it now because the compilation did not work unless I installed
> open-mpi and used the ia32 disable flags in the configure script. (--
> disable-ia32-3dnow --disable-ia32-3dnow, see previous post re that
> error).
>
> When I tried to compile GROMACS with lam-mpi installed, this is the
> 'make' error I got:
> ---
> mpicc -I/sw/include -framework Accelerate -o grompp topio.o
> toppush.o topcat.o topshake.o convparm.o tomorse.o sorting.o
> splitter.o vsite_parm.o readir.o add_par.o topexcl.o toputil.o
> topdirs.o grompp.o compute_io.o -L/sw/lib ../mdlib/.libs/
> libmd_mpi.a -L/usr/X11/lib ../gmxlib/.libs/libgmx_mpi.a /usr/local/
> lib/libfftw3f.a -lm /sw/lib/libXm.dylib /usr/X11/lib/libXt.
> 6.0.0.dylib /usr/X11/lib/libSM.6.0.0.dylib /usr/X11/lib/libICE.
> 6.3.0.dylib /usr/X11/lib/libXp.6.2.0.dylib /usr/X11/lib/libXext.
> 6.4.0.dylib /usr/X11/lib/libX11.6.2.0.dylib /usr/X11/lib/libXau.
> 6.0.0.dylib /usr/X11/lib/libXdmcp.6.0.0.dylib
> Undefined symbols:
> "_lam_mpi_byte", referenced from:
> _lam_mpi_byte$non_lazy_ptr in libgmx_mpi.a(network.o)
> "_lam_mpi_float", referenced from:
> _lam_mpi_float$non_lazy_ptr in libgmx_mpi.a(network.o)
> "_lam_mpi_comm_world", referenced from:
> _lam_mpi_comm_world$non_lazy_ptr in libgmx_mpi.a(network.o)
> ld: symbol(s) not found
> collect2: ld returned 1 exit status
> make[3]: *** [grompp] Error 1
> make[2]: *** [all-recursive] Error 1
> make[1]: *** [all] Error 2
> make: *** [all-recursive] Error 1
> ---
>
> As you can realize - I can't do any simulations for now. any ideas?
> Is this a GROMACS bug? if so, any chance it's fixed soon?
>
> Thanks in advance,
> Hadas Leonov.
>
> hleonov at cc.huji.ac.il
> Department of Biological Chemistry
> Alexander Silberman institute of Life Sciences
> The Hebrew University,
> Jerusalem, Israel
>
hleonov at cc.huji.ac.il
Department of Biological Chemistry
Alexander Silberman institute of Life Sciences
The Hebrew University,
Jerusalem, Israel
More information about the gromacs.org_gmx-users
mailing list