Details
-
New Feature
-
Status: To Do
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
In Kakao, S2Graph has been used as a master database to store all user's activities,
I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.
Use cases are as follows,
edge/vertex incoming through the Kafka save to other storages - druid sink for slice and dice - es sink for search - file sink for store edge/vertex ingest from various storage to s2graph - MySQL binlog - hdfs/hive/hbase ETL job on edge/vertex data - merge all user activities based on userId. - generate statistical information - apply ML library on graph data format
Below are some simple requirements for this,
- supports both streaming/static source data processing
- computation flow is re-usable and sharing on streaming and batch
- operate by simple job description
Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
Structured streaming provides several DataSource and Sink, and it supports the implementation of the Source/Sink interface.
Using this, we can easily develop ETL Job that can be linked to various repositories.
Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Attachments
Issue Links
- links to
1.
|
support custom udf class | In Progress | Chul Kang | |
2.
|
Reporting streaming job metrics | To Do | Chul Kang | |
3.
|
Provide JdbcSource/Sink | To Do | Chul Kang |