You are on page 1of 5

XNIO XNIO-265

Accept thread blocks forever on


java.nio.channels.SelectableChannel.register
Details

Type: Bug Resolution: Done


Priority: Major Fix Version/s: 3.5.7.Final, 3.6.7.Final, 3.7.4.Final, 3.8.0.Final
Affects Version/s: 3.3.4.Final
Component/s: nio-impl
Labels: None
Steps to Reproduce:  unknown

Description
I'm running into this issue in a small spring boot / undertow service that handles websocket and plain old HTTP POST requests.

After running fine for about a week it eventually got the XNIO accept thread blocked on AbstractSelectableChannel.register (stack trace for the
stuck thread below) and has remained stuck since.

I got what I believe was the same issue a couple weeks ago (but have no stack traces from that time), so this time around the application is
running with the YourKit agent active, and I have a memory snapshot in the stuck state (the thread has been blocked for 24h+ at this point).

"XNIO-2 Accept" #28 prio=5 os_prio=0 tid=0x00007fd2737de000 nid=0x7e0f waiting for monitor entry [0x00007fd262999000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.SelectorImpl.register(SelectorImpl.java:131)

waiting to lock <0x00000005cd8325d8> (a java.util.Collections$UnmodifiableSet)


at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:212)
locked <0x000000066e197560> (a java.lang.Object)
locked <0x000000066e197550> (a java.lang.Object)
at java.nio.channels.SelectableChannel.register(SelectableChannel.java:280)
at org.xnio.nio.WorkerThread.registerChannel(WorkerThread.java:696)
at org.xnio.nio.QueuedNioTcpServer.handleReady(QueuedNioTcpServer.java:465)
at org.xnio.nio.QueuedNioTcpServerHandle.handleReady(QueuedNioTcpServerHandle.java:38)
at org.xnio.nio.WorkerThread.run(WorkerThread.java:559)

The lock is held by another xnio worker thread:

"XNIO-2 I/O-7" #26 prio=5 os_prio=0 tid=0x00007fd27126c000 nid=0x7e0d runnable [0x00007fd262b9b000]


java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.$$YJP$$epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

locked <0x00000005cd8325e8> (a sun.nio.ch.Util$2)


locked <0x00000005cd8325d8> (a java.util.Collections$UnmodifiableSet)
locked <0x00000005cd829370> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.xnio.nio.WorkerThread.run(WorkerThread.java:509)


Attachments

nio-stack.txt Screen Shot 2016-03-05 at 3 Screen Shot 2016-03-05 at 3


2016/03/04 7:59 PM 72 kB 2016/03/04 9:12 PM 239 kB 2016/03/04 9:12 PM 279 kB

Issue Links

is incorporated by
JBEAP-17711 [GSS](7.2.z) Introduce alternative queued acceptor to fix XNIO-258 XNIO-286 XNIO-335 XNIO-265 VERIFIED

JBEAP-17712 [GSS](7.1.z) Introduce alternative queued acceptor to fix XNIO-258 XNIO-286 XNIO-335 XNIO-265 CLOSED

Activity

Oldest first

 Karsten Sperling (Inactive) added a comment - 2016/03/04 7:59 PM

Full thread dump

 Karsten Sperling (Inactive) added a comment - 2016/03/04 9:40 PM

Some screen shots from the memory dump and observations:

SynchTask for the blocked registerChannel() is in the selectorWorkQueue of I/O-7


The SynchTask is the last task in the queue at array offset 134217757 (tail == 134217758)
The SynchTask has done == false as expected
The ArrayDeque has a gigantic array of 200+M entries, even though the queue only has 22 elements at this point in time
The delayQueue is empty

 Karsten Sperling (Inactive) added a comment - 2016/03/15 8:30 PM

If this isn't the correct place to raise these types of issues, please feel free to point me towards the correct place. Thanks.

 Tarek Hammoud (Inactive) added a comment - 2017/05/03 5:24 PM

Karsten,

We are facing exactly the same issue. Our stack traces match yours to the tee. I suspect that adding a timeout on the select will allow the
registration to proceed. Did you figure out any workarounds?

 Tarek Hammoud (Inactive) added a comment - 2017/05/04 2:53 PM

Looking at the WorkerThread.java code, I believe a race condition exists that will cause this lockup. If the I/O thread chosen has just polled
for null task and its on its way to the select(), the selector.wakeup() in registerChannel will be a noop. The I/O thread can call select()
before registerChannel calls channel.register causing the lockup inside sun's code (synch on some collection).

No easy to reproduce but can happen. With a timeout on the selects in the I/O threads, I believe the problem will not happen as that will
cause the IO thread to pick up the synch task safely after the timeout.

 Tarek Hammoud (Inactive) added a comment - 2017/07/24 11:06 AM

We have finally been able to get this to happen on a non-client facing server. Below is what we found. Very close to what Karsten was
seeing. Using groovy, we were able to dump the state of the WorkerThread.java instances.

All of the WorkerThreads except for one, had selectorWorkQueue with a size of 16. The bad worker thread had an internal array size of
one Billion yet the queue only contains 73 entries in our case. This is a non-client facing box so we are a bit surprised by the huge size of
the array.

Dumping the contents of the queue, shows a head and tail pointing to a nulls. Entries follow:

09:46:09,818 INFO [stdout] Thread: default I/O-2


09:46:09,818 INFO [stdout] selectorWorkQueue = 73
09:46:09,819 INFO [stdout] delayWorkQueue = 0
09:46:09,819 INFO [stdout] ----------------------------------------
09:46:09,819 INFO [stdout] ArrayDeque Debug
09:46:09,820 INFO [stdout] elements.length = 1073741824
09:46:09,820 INFO [stdout] head = 536870910
09:46:09,821 INFO [stdout] tail = 536870983
09:46:09,822 INFO [stdout] 536870910) null
09:46:09,823 INFO [stdout] 536870911) org.xnio.nio.WorkerThread$SynchTask@158ac684
09:46:09,823 INFO [stdout] 536870912) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,824 INFO [stdout] 536870913) org.xnio.nio.WorkerThread$SynchTask@6fff5579
09:46:09,825 INFO [stdout] 536870914) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,825 INFO [stdout] 536870915) org.xnio.nio.WorkerThread$SynchTask@6073ebbd
09:46:09,826 INFO [stdout] 536870916) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,826 INFO [stdout] 536870917) org.xnio.nio.WorkerThread$SynchTask@238eafd1
09:46:09,827 INFO [stdout] 536870918) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,828 INFO [stdout] 536870919) org.xnio.nio.WorkerThread$SynchTask@485802b2
09:46:09,828 INFO [stdout] 536870920) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,829 INFO [stdout] 536870921) org.xnio.nio.WorkerThread$SynchTask@5d99d25d
09:46:09,829 INFO [stdout] 536870922) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,830 INFO [stdout] 536870923) org.xnio.nio.WorkerThread$SynchTask@22c8f450
09:46:09,831 INFO [stdout] 536870924) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,831 INFO [stdout] 536870925) org.xnio.nio.WorkerThread$SynchTask@74ee27ff
09:46:09,832 INFO [stdout] 536870926) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,832 INFO [stdout] 536870927) org.xnio.nio.WorkerThread$SynchTask@7cabe642
09:46:09,833 INFO [stdout] 536870928) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,833 INFO [stdout] 536870929) org.xnio.nio.WorkerThread$SynchTask@6f33c8b0
09:46:09,834 INFO [stdout] 536870930) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,835 INFO [stdout] 536870931) org.xnio.nio.WorkerThread$SynchTask@a3cdb72
09:46:09,835 INFO [stdout] 536870932) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,835 INFO [stdout] 536870933) org.xnio.nio.WorkerThread$SynchTask@f68d415
09:46:09,836 INFO [stdout] 536870934) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,837 INFO [stdout] 536870935) org.xnio.nio.WorkerThread$SynchTask@2c52aa4c
09:46:09,837 INFO [stdout] 536870936) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,838 INFO [stdout] 536870937) org.xnio.nio.WorkerThread$SynchTask@5dd5913e
09:46:09,838 INFO [stdout] 536870938) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,839 INFO [stdout] 536870939) org.xnio.nio.WorkerThread$SynchTask@31de26bd
09:46:09,839 INFO [stdout] 536870940) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,840 INFO [stdout] 536870941) org.xnio.nio.WorkerThread$SynchTask@78df94ef
09:46:09,841 INFO [stdout] 536870942) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,841 INFO [stdout] 536870943) org.xnio.nio.WorkerThread$SynchTask@41664172
09:46:09,841 INFO [stdout] 536870944) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,842 INFO [stdout] 536870945) org.xnio.nio.WorkerThread$SynchTask@304b7401
09:46:09,842 INFO [stdout] 536870946) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,843 INFO [stdout] 536870947) org.xnio.nio.WorkerThread$SynchTask@7060577b
09:46:09,843 INFO [stdout] 536870948) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,844 INFO [stdout] 536870949) org.xnio.nio.WorkerThread$SynchTask@741cdd74
09:46:09,844 INFO [stdout] 536870950) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,845 INFO [stdout] 536870951) org.xnio.nio.WorkerThread$SynchTask@7027f54
09:46:09,845 INFO [stdout] 536870952) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,846 INFO [stdout] 536870953) org.xnio.nio.WorkerThread$SynchTask@6c82f8c9
09:46:09,846 INFO [stdout] 536870954) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,847 INFO [stdout] 536870955) org.xnio.nio.WorkerThread$SynchTask@1eabb1ad
09:46:09,847 INFO [stdout] 536870956) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,848 INFO [stdout] 536870957) org.xnio.nio.WorkerThread$SynchTask@c4e3871
09:46:09,848 INFO [stdout] 536870958) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,848 INFO [stdout] 536870959) org.xnio.nio.WorkerThread$SynchTask@18221e73
09:46:09,849 INFO [stdout] 536870960) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,850 INFO [stdout] 536870961) org.xnio.nio.WorkerThread$SynchTask@639df90e
09:46:09,850 INFO [stdout] 536870962) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,851 INFO [stdout] 536870963) org.xnio.nio.WorkerThread$SynchTask@20f80a09
09:46:09,851 INFO [stdout] 536870964) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,852 INFO [stdout] 536870965) org.xnio.nio.WorkerThread$SynchTask@5d0833d7
09:46:09,852 INFO [stdout] 536870966) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,853 INFO [stdout] 536870967) org.xnio.nio.WorkerThread$SynchTask@55516f83
09:46:09,853 INFO [stdout] 536870968) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,854 INFO [stdout] 536870969) org.xnio.nio.WorkerThread$SynchTask@5988a9ae
09:46:09,854 INFO [stdout] 536870970) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,855 INFO [stdout] 536870971) org.xnio.nio.WorkerThread$SynchTask@39c048d4
09:46:09,855 INFO [stdout] 536870972) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,856 INFO [stdout] 536870973) org.xnio.nio.WorkerThread$SynchTask@254a42b0
09:46:09,856 INFO [stdout] 536870974) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,857 INFO [stdout] 536870975) org.xnio.nio.WorkerThread$SynchTask@7b390ced
09:46:09,858 INFO [stdout] 536870976) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,858 INFO [stdout] 536870977) org.xnio.nio.WorkerThread$SynchTask@67e00ffd
09:46:09,858 INFO [stdout] 536870978) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,859 INFO [stdout] 536870979) org.xnio.nio.WorkerThread$SynchTask@5cdd7d3d
09:46:09,859 INFO [stdout] 536870980) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,860 INFO [stdout] 536870981) org.xnio.nio.WorkerThread$SynchTask@2b75ebd4
09:46:09,861 INFO [stdout] 536870982) org.xnio.nio.QueuedNioTcpServer$1@61db416
09:46:09,861 INFO [stdout] 536870983) null

Using Groovy, and calling clear on the queue, fixed the problem.

How did the queue get into such state? Still looking. The WorkerThread code seems to lock around the collection in all places except for
two peeks.

 Tarek Hammoud (Inactive) added a comment - 2017/07/26 9:24 AM - edited


We believe that this issue is caused by undertow. https://issues.jboss.org/browse/UNDERTOW-659. The bug in undertow causes the
numerous tasks to be inserted in the queue.

Looking at ArrayDeque:

public void addLast(E e)


{ if (e == null) throw new NullPointerException(); elements[tail] = e; * if ( (tail = (tail + 1) & (elements.length - 1)) == head)*
doubleCapacity(); }

The tail member is mutated before doubleCapacity throws the exception (integer wrap around). Undertow catches the exception but logs
it in debug thus we do not see the error in the logs.

to prove that head does become zero, we copied our version of ArrayDeque and hard coded the maximum to be 16 elements. We were
able to get head to be null which explains why the processing on the thread stopped until we cleared it with Groovy.

As to why the register failed, it is now easy to explain. The SynchTask never ran (No park Called) as head returned null. The worker thread
went back into select before channel.register was called causing the lockup.

As a side note, peek should be synchronized to get a consistent state.

 David Lloyd added a comment - 2017/07/28 3:39 PM

Does the problem still occur with this fix?

https://github.com/xnio/xnio/pull/126

 Tarek Hammoud (Inactive) added a comment - 2017/07/28 4:09 PM

David,

It will still occur as ArrayDeque is getting corrupted due to the mutation of the tail field in the addLast() method in which
doubleCapacity() will throw the exception (integer roll over in our case as we had a billion entries and the ArrayDeque code was trying to
double that).

Undertow is the real culprit as they insert hunderds of millions of tasks due to an SSL Conduit bug. They did fix the issue. Once the queue
is corrupted, poll() will always return null which causes accept to hang.

Don't know how much you can do about this as your code does not (and should not) expect to have the queue corrupted due to memory
limits caused by a buggy client.

 Tarek Hammoud (Inactive) added a comment - 2017/07/28 4:10 PM

We were only able to consistently reproduce this by using a set of penetration tests. Pure luck. Will get you the name of the test suit if
interested.

 Marius Tantareanu (Inactive) added a comment - 2017/09/20 3:18 AM

I encountered a similar problem in a WildFly 10.1.0 client application that uses EJB Client API to invoke some EJBs remotely. At some point
the client blocks and the following XNIO thread is causing high CPU usage:

"Remoting "config-based-ejb-client-endpoint" I/O-1" #6025 daemon prio=5 os_prio=0 tid=0x00007f2f709d3000 nid=0x30e0 runnable
[0x00007f2e7f9be000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

locked <0x00000005cc1844e8> (a sun.nio.ch.Util$3)


locked <0x00000005cc19f258> (a java.util.Collections$UnmodifiableSet)
locked <0x00000005cc184468> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.xnio.nio.WorkerThread.run(WorkerThread.java:515)

All the details, including how to consistently reproduce this are described under https://issues.jboss.org/browse/WFLY-9364.

Applying the fix from https://github.com/xnio/xnio/pull/126 does not fix the issue.

 David Lloyd added a comment - 2017/09/20 9:09 AM

mariustant it doesn't look like this is the same problem; I think this is likely an already-fixed bug XNIO-244 or a close cousin to it. You
would have to check the server logs to verify the XNIO version you are using.

 Marius Tantareanu (Inactive) added a comment - 2017/09/20 4:33 PM


I am using XNIO 3.4.0.Final which comes with WildFly 10.1.0 on both the server side and the client side. This should have the fix from
XNIO-244. The problem only occurs on the client side in my case, the server is fine. Also after a comment on WFLY-9364 I tried to use
jboss-remoting 4.0.24.Final and XNIO 3.4.6.Final on the client side. The problem is still reproducible.

People

Assignee:
David Lloyd 

Reporter:
Karsten Sperling (Inactive) 

Votes:
0 Vote for this issue

Watchers:
5 Start watching this issue

Dates

Created:
2016/03/04 7:41 PM

Updated:
2019/10/02 7:54 PM

Resolved:
2019/10/02 7:54 PM

You might also like