Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
-
AWS EMR 6.1.0
Spark 3.0.0
Kinesis
Description
Spark streaming application runs at 60s batch intervals.
The application runs fine processing batches around 40s. After ~8600 batches (around 6 days), the application all of a sudden hits a wall and processing time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This happens consistently every 6 days, regardless of data.
Looking at the application logs, it seems like when the issue begins, tasks are being completed by executors, however the driver is taking a while to acknowledge. I have taken numerous memory dumps of the driver (before it hits the 6 day wall) using jcmd and can see the org.apache.spark.scheduler.AsyncEventQueue is growing in size despite the fact that the application is able to keep up with batches. I have yet to take a snapshot of the application in the broken state.