[BAHIR-213] Faster S3 file Source for Structured Streaming with SQS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Spark-2.4.0
Fix Version/s: Spark-2.4.0
Component/s: Spark Structured Streaming Connectors
Labels:
None

External issue URL:
https://issues.apache.org/jira/browse/SPARK-28124

Description

Using FileStreamSource to read files from a S3 bucket has problems both in terms of costs and latency:

Latency: Listing all the files in S3 buckets every microbatch can be both slow and resource intensive.
Costs: Making List API requests to S3 every microbatch can be costly.

The solution is to use Amazon Simple Queue Service (SQS) which lets you find new files written to S3 bucket without the need to list all the files every microbatch.

S3 buckets can be configured to send notification to an Amazon SQS Queue on Object Create / Object Delete events. For details see AWS documentation here Configuring S3 Event Notifications

Spark can leverage this to find new files written to S3 bucket by reading notifications from SQS queue instead of listing files every microbatch.

I hope to contribute changes proposed in this pull request to Apache Bahir as suggested by gaborgsomogyi here

Attachments

Issue Links

relates to

SPARK-28124 Faster S3 file source with SQS

Resolved

links to

GitHub Pull Request #91

Activity

People

Assignee:: Abhishek Dixit

Reporter:: Abhishek Dixit

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Jul/19 13:02

Updated:: 30/Dec/19 04:43

Resolved:: 28/Dec/19 20:41