[CASSANDRA-14855] Message Flusher scheduling fell off the event loop, resulting in out of memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 3.0.18
Component/s: Messaging/Client
Labels:
- pull-request-available

Severity:
Normal

Description

We recently had a production issue where about 10 nodes in a 96 node cluster ran out of heap.

From heap dump analysis, I believe there is enough evidence to indicate `queued` data member of the Flusher got too big, resulting in out of memory.
Below are specifics on what we found from the heap dump (relevant screenshots attached):

non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and multiple such instances.
"running" data member of Flusher having "true" value
Size of scheduledTasks on the eventloop was 0.

We suspect something (maybe an exception) caused the Flusher running state to continue to be true, but was not able to schedule itself with the event loop.
Could not find any ERROR in the system.log, except for following INFO logs around the incident time.

INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - Unexpected exception during request; channel = [id: 0x8d288811, L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
 at io.netty.channel.unix.Errors.newIOException(Errors.java:117) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.unix.Errors.ioResult(Errors.java:138) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) [netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) [netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) [netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-all-4.0.44.Final.jar:4.0.44.Final]

I would like to pursue the following proposals to fix this issue:

ImmediateFlusher: Backport trunk's ImmediateFlusher ( CASSANDRA-13651 https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems to be more robust than the existing Flusher as it does not depend on any running state/scheduling.
Make "queued" data member of the Flusher bounded to avoid any potential of causing out of memory due to otherwise unbounded nature.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

read_latency.png
29/Oct/18 05:48
536 kB
Sumanth Pasupuleti
heap.png
29/Oct/18 05:48
1.23 MB
Sumanth Pasupuleti
heap_dump.png
29/Oct/18 05:48
1.10 MB
Sumanth Pasupuleti
flusher running state.png
29/Oct/18 05:48
208 kB
Sumanth Pasupuleti
eventloop_scheduledtasks.png
29/Oct/18 05:48
239 kB
Sumanth Pasupuleti
cpu.png
29/Oct/18 05:48
1.06 MB
Sumanth Pasupuleti
blocked_thread_pool.png
29/Oct/18 05:48
330 kB
Sumanth Pasupuleti

Issue Links

links to

GitHub Pull Request #293

Activity

People

Assignee:: Sumanth Pasupuleti

Reporter:: Sumanth Pasupuleti

Authors:: Sumanth Pasupuleti

Reviewers:: Benedict Elliott Smith

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Oct/18 05:49

Updated:: 16/Mar/22 12:07

Resolved:: 06/Dec/18 15:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h