[S2GRAPH-185] Support Spark Structured Streaming to work with data in streaming and batch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: To Do
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: s2jobs
Labels:
None

Description

By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
In Kakao, S2Graph has been used as a master database to store all user's activities,
I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.

Use cases are as follows,

edge/vertex incoming through the Kafka save to other storages
- druid sink for slice and dice
- es sink for search
- file sink for store edge/vertex

ingest from various storage to s2graph
- MySQL binlog
- hdfs/hive/hbase

ETL job on edge/vertex data
- merge all user activities based on userId.
- generate statistical information
- apply ML library on graph data format

Below are some simple requirements for this,

supports both streaming/static source data processing
computation flow is re-usable and sharing on streaming and batch
operate by simple job description

Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
Structured streaming provides several DataSource and Sink, and it supports the implementation of the Source/Sink interface.

Using this, we can easily develop ETL Job that can be linked to various repositories.

Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Attachments

Issue Links

links to

GitHub Pull Request #144

Sub-Tasks

1.	support custom udf class	In Progress	Chul Kang
2.	Reporting streaming job metrics	To Do	Chul Kang
3.	Provide JdbcSource/Sink	To Do	Chul Kang

Activity

People

Assignee:: Chul Kang

Reporter:: Chul Kang

Votes:: 3 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Mar/18 08:30

Updated:: 30/Mar/18 10:02

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified

Include sub-tasks