Uploaded image for project: 'S2Graph'
  1. S2Graph
  2. S2GRAPH-185

Support Spark Structured Streaming to work with data in streaming and batch

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: To Do
    • Major
    • Resolution: Unresolved
    • None
    • None
    • s2jobs
    • None

    Description

      By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL format.
      In Kakao, S2Graph has been used as a master database to store all user's activities,
      I have been developing several ETL jobs that are suitable for these use-cases, and I want to contribute them.

      Use cases are as follows,

      edge/vertex incoming through the Kafka save to other storages
      - druid sink for slice and dice
      - es sink for search
      - file sink for store edge/vertex
      
      ingest from various storage to s2graph
      - MySQL binlog
      - hdfs/hive/hbase
      
      ETL job on edge/vertex data
      - merge all user activities based on userId.
      - generate statistical information
      - apply ML library on graph data format
      

       

      Below are some simple requirements for this,

      • supports both streaming/static source data processing
      • computation flow is re-usable and sharing on streaming and batch
      • operate by simple job description

       

      Spark Structured Streaming supports unified API for both streaming and batch by using Dataframe/Dataset API from SparkSQL.
      It allows the same operations to be executed on bounded/unbounded data sources and guarantees exactly-once fault-tolerance.
      Structured streaming provides several DataSource and Sink, and it supports the implementation of the Source/Sink interface.

      Using this, we can easily develop ETL Job that can be linked to various repositories.

       

      Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

       

      Attachments

        Issue Links

          Activity

            People

              chul Chul Kang
              chul Chul Kang
              Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified