[SPARK-34063] Major slowdown in spark streaming after 6 days - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: Scheduler, Spark Core
Labels:
None
Environment:

AWS EMR 6.1.0

Spark 3.0.0

Kinesis

Description

Spark streaming application runs at 60s batch intervals.

The application runs fine processing batches around 40s. After ~8600 batches (around 6 days), the application all of a sudden hits a wall and processing time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This happens consistently every 6 days, regardless of data.

Looking at the application logs, it seems like when the issue begins, tasks are being completed by executors, however the driver is taking a while to acknowledge. I have taken numerous memory dumps of the driver (before it hits the 6 day wall) using jcmd and can see the org.apache.spark.scheduler.AsyncEventQueue is growing in size despite the fact that the application is able to keep up with batches. I have yet to take a snapshot of the application in the broken state.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2020-12-29.pdf
11/Jan/21 00:04
3.74 MB
Calvin Pietersen
normal-job
11/Jan/21 00:03
75 kB
Calvin Pietersen
slow-job
11/Jan/21 00:03
86 kB
Calvin Pietersen

Activity

People

Assignee:: Unassigned

Reporter:: Calvin Pietersen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Jan/21 00:02

Updated:: 09/Feb/21 22:25