Professional Documents
Culture Documents
(XNIO-265) Accept Thread Blocks Forever On Java - Nio.channels - Selectablechannel.register - Red Hat Issue Tracker
(XNIO-265) Accept Thread Blocks Forever On Java - Nio.channels - Selectablechannel.register - Red Hat Issue Tracker
Description
I'm running into this issue in a small spring boot / undertow service that handles websocket and plain old HTTP POST requests.
After running fine for about a week it eventually got the XNIO accept thread blocked on AbstractSelectableChannel.register (stack trace for the
stuck thread below) and has remained stuck since.
I got what I believe was the same issue a couple weeks ago (but have no stack traces from that time), so this time around the application is
running with the YourKit agent active, and I have a memory snapshot in the stuck state (the thread has been blocked for 24h+ at this point).
"XNIO-2 Accept" #28 prio=5 os_prio=0 tid=0x00007fd2737de000 nid=0x7e0f waiting for monitor entry [0x00007fd262999000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.SelectorImpl.register(SelectorImpl.java:131)
Attachments
Issue Links
is incorporated by
JBEAP-17711 [GSS](7.2.z) Introduce alternative queued acceptor to fix XNIO-258 XNIO-286 XNIO-335 XNIO-265 VERIFIED
JBEAP-17712 [GSS](7.1.z) Introduce alternative queued acceptor to fix XNIO-258 XNIO-286 XNIO-335 XNIO-265 CLOSED
Activity
Oldest first
If this isn't the correct place to raise these types of issues, please feel free to point me towards the correct place. Thanks.
Karsten,
We are facing exactly the same issue. Our stack traces match yours to the tee. I suspect that adding a timeout on the select will allow the
registration to proceed. Did you figure out any workarounds?
Looking at the WorkerThread.java code, I believe a race condition exists that will cause this lockup. If the I/O thread chosen has just polled
for null task and its on its way to the select(), the selector.wakeup() in registerChannel will be a noop. The I/O thread can call select()
before registerChannel calls channel.register causing the lockup inside sun's code (synch on some collection).
No easy to reproduce but can happen. With a timeout on the selects in the I/O threads, I believe the problem will not happen as that will
cause the IO thread to pick up the synch task safely after the timeout.
We have finally been able to get this to happen on a non-client facing server. Below is what we found. Very close to what Karsten was
seeing. Using groovy, we were able to dump the state of the WorkerThread.java instances.
All of the WorkerThreads except for one, had selectorWorkQueue with a size of 16. The bad worker thread had an internal array size of
one Billion yet the queue only contains 73 entries in our case. This is a non-client facing box so we are a bit surprised by the huge size of
the array.
Dumping the contents of the queue, shows a head and tail pointing to a nulls. Entries follow:
Using Groovy, and calling clear on the queue, fixed the problem.
How did the queue get into such state? Still looking. The WorkerThread code seems to lock around the collection in all places except for
two peeks.
Looking at ArrayDeque:
The tail member is mutated before doubleCapacity throws the exception (integer wrap around). Undertow catches the exception but logs
it in debug thus we do not see the error in the logs.
to prove that head does become zero, we copied our version of ArrayDeque and hard coded the maximum to be 16 elements. We were
able to get head to be null which explains why the processing on the thread stopped until we cleared it with Groovy.
As to why the register failed, it is now easy to explain. The SynchTask never ran (No park Called) as head returned null. The worker thread
went back into select before channel.register was called causing the lockup.
https://github.com/xnio/xnio/pull/126
David,
It will still occur as ArrayDeque is getting corrupted due to the mutation of the tail field in the addLast() method in which
doubleCapacity() will throw the exception (integer roll over in our case as we had a billion entries and the ArrayDeque code was trying to
double that).
Undertow is the real culprit as they insert hunderds of millions of tasks due to an SSL Conduit bug. They did fix the issue. Once the queue
is corrupted, poll() will always return null which causes accept to hang.
Don't know how much you can do about this as your code does not (and should not) expect to have the queue corrupted due to memory
limits caused by a buggy client.
We were only able to consistently reproduce this by using a set of penetration tests. Pure luck. Will get you the name of the test suit if
interested.
I encountered a similar problem in a WildFly 10.1.0 client application that uses EJB Client API to invoke some EJBs remotely. At some point
the client blocks and the following XNIO thread is causing high CPU usage:
"Remoting "config-based-ejb-client-endpoint" I/O-1" #6025 daemon prio=5 os_prio=0 tid=0x00007f2f709d3000 nid=0x30e0 runnable
[0x00007f2e7f9be000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
All the details, including how to consistently reproduce this are described under https://issues.jboss.org/browse/WFLY-9364.
Applying the fix from https://github.com/xnio/xnio/pull/126 does not fix the issue.
mariustant it doesn't look like this is the same problem; I think this is likely an already-fixed bug XNIO-244 or a close cousin to it. You
would have to check the server logs to verify the XNIO version you are using.
People
Assignee:
David Lloyd
Reporter:
Karsten Sperling (Inactive)
Votes:
0 Vote for this issue
Watchers:
5 Start watching this issue
Dates
Created:
2016/03/04 7:41 PM
Updated:
2019/10/02 7:54 PM
Resolved:
2019/10/02 7:54 PM