[gmx-users] possible problem with gpc-f106n004

Sun Oct 21 14:36:45 CEST 2012

Dear SciNet:

I have resubmitted, just letting you know.

Thank you,
Chris.

starting mdrun 'title'
1000000000 steps, 2000000.0 ps (continuing from step 101920720, 203841.4 ps).
[[3105,1],27][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982784 opcode 32767  vendor error 129 qp_idx 0
[[3105,1],25][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982016 opcode 32767  vendor error 129 qp_idx 0
[[3105,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11032832 opcode 32767  vendor error 129 qp_idx 2
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.  

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 20).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   gpc-f106n004
  Local device: mlx4_0
  Peer host:    gpc-f106n003-ib0

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 28 with PID 6738 on
node gpc-f106n004-ib0 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[[3105,1],5][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 12949632 opcode 32767  vendor error 129 qp_idx 0
[[3105,1],7][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11033216 opcode 32767  vendor error 129 qp_idx 2
[gpc-f106n001:06928] 4 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded
[gpc-f106n001:06928] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpc-f106n001:06928] [[3105,0],0]-[[3105,0],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
=>> PBS: job killed: node 2 (gpc-f106n003-ib0) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
Terminated
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate