[gmx-users] possible problem with gpc-f106n004
Christopher Neale
chris.neale at mail.utoronto.ca
Sun Oct 21 14:36:45 CEST 2012
Dear SciNet:
I have resubmitted, just letting you know.
Thank you,
Chris.
starting mdrun 'title'
1000000000 steps, 2000000.0 ps (continuing from step 101920720, 203841.4 ps).
[[3105,1],27][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982784 opcode 32767 vendor error 129 qp_idx 0
[[3105,1],25][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 10982016 opcode 32767 vendor error 129 qp_idx 0
[[3105,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n004 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11032832 opcode 32767 vendor error 129 qp_idx 2
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 20). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: gpc-f106n004
Local device: mlx4_0
Peer host: gpc-f106n003-ib0
You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 28 with PID 6738 on
node gpc-f106n004-ib0 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[[3105,1],5][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 12949632 opcode 32767 vendor error 129 qp_idx 0
[[3105,1],7][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc] from gpc-f106n001 to: gpc-f106n003-ib0 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 11033216 opcode 32767 vendor error 129 qp_idx 2
[gpc-f106n001:06928] 4 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded
[gpc-f106n001:06928] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpc-f106n001:06928] [[3105,0],0]-[[3105,0],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
=>> PBS: job killed: node 2 (gpc-f106n003-ib0) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
Terminated
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
More information about the gromacs.org_gmx-users
mailing list