[gmx-users] Simulation runs on iMac but explodes on cluster

Wed Jul 13 10:06:59 CEST 2011

Hi Luck,

Could you give all the necessary information about your system to help us to figure where the problem could be?

What kind of compounds are you simulating?
What size of box are you using?
Do  you run on multiple thread when you run it on your iMac?
How many CPU's are you using on the cluster?

Cheers,
Emanuel

=========================================================
Emanuel Birru
PhD Candidate

Faculty of Pharmacy and Pharmaceutical Sciences
Monash University (Parkville Campus)
381 Royal Parade, Parkville
Victoria 3052, Australia

Tel: Int + 61 3 9903 9187
E-mail: emanuel.birru at monash.edu<mailto:firstname.lastname at monash.edu>
www.pharm.monash.edu.au<http://www.pharm.monash.edu.au>

From: gmx-users-bounces at gromacs.org [mailto:gmx-users-bounces at gromacs.org] On Behalf Of Luke Goodsell
Sent: Wednesday, 13 July 2011 5:36 PM
To: GROMACS Users mailinglist
Subject: [gmx-users] Simulation runs on iMac but explodes on cluster

Hi,

As the subject suggests, I have a simulation that runs correctly on my iMac, but fails when I try to run it on a cluster, and I am hoping someone may be able to suggest which things to try first to resolve the issue.

Background:
The simulation proceeds perfectly well on the iMac (OS X 10.5) without error/warning. On the cluster, it begins producing multiple LINCS warnings at step 14555 (of 7500000) and then segfaults after step 14556 with:

[node-005:13244] *** Process received signal ***
[node-005:13244] Signal: Segmentation fault (11)
[node-005:13244] Signal code: Address not mapped (1)
[node-005:13244] Failing at address: 0x2aaab1380520
[node-005:13244] [ 0] /lib64/libpthread.so.0 [0x2aaaac402b10]
[node-005:13244] [ 1] mdrun_mpi(nb_kernel410_x86_64_sse+0xa65) [0x947e25]
[node-005:13244] [ 2] mdrun_mpi(do_nonbonded+0x780) [0x8ce890]
[node-005:13244] [ 3] mdrun_mpi(do_force_lowlevel+0x308) [0x6842b8]
[node-005:13244] [ 4] mdrun_mpi(do_force+0xc59) [0x6f7c19]
[node-005:13244] [ 5] mdrun_mpi(do_md+0x5785) [0x626f75]
[node-005:13244] [ 6] mdrun_mpi(mdrunner+0xa07) [0x61e8a7]
[node-005:13244] [ 7] mdrun_mpi(main+0x1363) [0x62c5f3]
[node-005:13244] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaac62d994]
[node-005:13244] [ 9] mdrun_mpi(__gxx_personality_v0+0x479) [0x44b659]
[node-005:13244] *** End of error message ***

Things I have tried:
* Both MPI and non-MPI versions on cluster (same result)
* Harmonising FFTW - configured and compiled fftw3 from same source using same configuration and ensured correct library was included during configure step
* Checking the Reproducibility documentation
* Searching the archives - I didn't find anything that described a similar problem.

Things I think may be involved:
* Different architectures - i686 vs x86_64 - don't know how to test for this
* Different BLAS/LAPACK libraries - I believe gromacs uses the vecLb on OS X; maybe I could compile without external BLAS/LAPACK and see if this makes a difference
* Some other unknown problem

I've currently spent more than 2 weeks trying to diagnose this problem and don't seem to be making progress. Could anyone suggest what is the most likely cause of this significant difference in output, and what I could do to test/fix it?

Any help is greatly appreciated.

Luke

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110713/5e71c103/attachment.html>