Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360

Structured Streaming (aka Streaming DataFrames)

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 2.1.0
    • Structured Streaming
    • None

    Description

      Umbrella ticket to track what's needed to make streaming DataFrame a reality.

      Attachments

        Issue Links

          1.
          API design: convergence of batch and streaming DataFrame Sub-task Resolved Reynold Xin
          2.
          Initial infrastructure Sub-task Resolved Michael Armbrust
          3.
          API design: external state management Sub-task Closed Unassigned
          4.
          API for managing streaming dataframes Sub-task Resolved Tathagata Das
          5.
          Add FileStreamSource Sub-task Resolved Shixiong Zhu
          6.
          Remove DataStreamReader/Writer Sub-task Resolved Reynold Xin
          7.
          Rename DataFrameWriter.stream DataFrameWriter.startStream Sub-task Resolved Reynold Xin
          8.
          State Store: A new framework for state management for computing Streaming Aggregates Sub-task Resolved Tathagata Das
          9.
          Old streaming DataFrame proposal by Cheng Hao (Intel) Sub-task Closed Cheng Hao
          10.
          WAL for determistic batches with IDs Sub-task Resolved Michael Armbrust
          11.
          Simple FileSink for Parquet Sub-task Resolved Michael Armbrust
          12.
          Windowing for structured streaming Sub-task Resolved Burak Yavuz
          13.
          Add processing time trigger Sub-task Resolved Shixiong Zhu
          14.
          Streaming Aggregation Sub-task Resolved Michael Armbrust
          15.
          Method to determine if Dataset is bounded or not Sub-task Resolved Burak Yavuz
          16.
          Memory Sink Sub-task Resolved Michael Armbrust
          17.
          Define analysis rules for operations not supported in streaming Sub-task Resolved Tathagata Das
          18.
          Python API for methods introduced for Structured Streaming Sub-task Resolved Burak Yavuz
          19.
          Add partitioned parquet support file stream sink Sub-task Resolved Tathagata Das
          20.
          Refactor DataSource to ensure schema is inferred only once when creating a file stream Sub-task Resolved Tathagata Das
          21.
          Refactor StreamTests to test for source fault-tolerance correctly. Sub-task Resolved Tathagata Das
          22.
          Add support in file stream source for reading new files added to subdirs Sub-task Resolved Tathagata Das
          23.
          Add support for batch jobs correctly inferring partitions from data written with file stream sink Sub-task Resolved Tathagata Das
          24.
          Disable support for multiple streaming aggregations Sub-task Resolved Tathagata Das
          25.
          Disable schema inference for streaming datasets on file streams Sub-task Resolved Tathagata Das
          26.
          Add support for complete output mode Sub-task Resolved Tathagata Das
          27.
          Make continuous Parquet writes consistent with non-continuous Parquet writes Sub-task Closed Unassigned
          28.
          Allow sorting on aggregated streaming dataframe when the output mode is Complete Sub-task Resolved Tathagata Das
          29.
          Add support for socket stream. Sub-task Closed Prashant Sharma
          30.
          Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery Sub-task Resolved Shixiong Zhu
          31.
          Add a unique id to ContinuousQuery Sub-task Resolved Tathagata Das
          32.
          Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer Sub-task Resolved Tathagata Das
          33.
          Renamed ContinuousQuery to StreamingQuery for simplicity Sub-task Resolved Tathagata Das
          34.
          Fix bug in python DataStreamReader Sub-task Resolved Tathagata Das
          35.
          Properly explain the streaming queries Sub-task Resolved Shixiong Zhu
          36.
          Fix complete mode aggregation with console sink Sub-task Resolved Shixiong Zhu
          37.
          Sleep when no new data arrives to avoid 100% CPU usage Sub-task Resolved Shixiong Zhu
          38.
          Enable test for sql/streaming.py and fix these tests Sub-task Resolved Shixiong Zhu
          39.
          HDFSMetadataLog.get leaks the input stream Sub-task Resolved Shixiong Zhu
          40.
          Add ContinuousQueryInfo to make ContinuousQueryListener events serializable Sub-task Resolved Shixiong Zhu
          41.
          Add network word count example Sub-task Resolved James Thomas
          42.
          StreamExecution.awaitOffset may take too long because of thread starvation Sub-task Resolved Shixiong Zhu
          43.
          Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering" Sub-task Resolved Shixiong Zhu
          44.
          Add a file sink log to support versioning and compaction Sub-task Resolved Shixiong Zhu
          45.
          Fix a race condition in StreamExecution.processAllAvailable Sub-task Resolved Shixiong Zhu
          46.
          Fix the race conditions in MemoryStream and MemorySink Sub-task Resolved Shixiong Zhu
          47.
          Move FileSource offset log into checkpointLocation Sub-task Resolved Shixiong Zhu
          48.
          Add a note to warn that onQueryProgress is asynchronous Sub-task Resolved Shixiong Zhu
          49.
          QueryProgress should be post after committedOffsets is updated Sub-task Resolved Shixiong Zhu
          50.
          StateStoreCoordinator should extend ThreadSafeRpcEndpoint Sub-task Resolved Shixiong Zhu
          51.
          Allow multiple continuous queries to be started from the same DataFrame Sub-task Resolved Shixiong Zhu
          52.
          Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite Sub-task Resolved Shixiong Zhu
          53.
          Add MetadataLog and HDFSMetadataLog Sub-task Resolved Shixiong Zhu
          54.
          ContinuousQueryManagerSuite floods the logs with garbage Sub-task Resolved Shixiong Zhu
          55.
          Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering Sub-task Resolved Shixiong Zhu
          56.
          Add ConsoleSink for structure streaming to display the dataframe on the fly Sub-task Resolved Saisai Shao
          57.
          Flaky Test: Complete aggregation with Console sink Sub-task Resolved Shixiong Zhu
          58.
          ConsoleSink should not require checkpointLocation Sub-task Resolved Shixiong Zhu
          59.
          Add Structured Streaming Programming Guide Sub-task Resolved Tathagata Das
          60.
          Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package Sub-task Resolved Tathagata Das
          61.
          Add an option in file stream source to read 1 file at a time Sub-task Resolved Tathagata Das
          62.
          Fix StreamingQueryListener to return message and stacktrace of actual exception Sub-task Resolved Tathagata Das
          63.
          Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError Sub-task Resolved Tathagata Das
          64.
          Metrics for Structured Streaming Sub-task Resolved Tathagata Das
          65.
          Add methods to convert StreamingQueryStatus to json Sub-task Resolved Tathagata Das
          66.
          History Server is broken because of the refactoring work in Structured Streaming Sub-task Resolved Shixiong Zhu
          67.
          ForeachSink should fail the Spark job if `process` throws exception Sub-task Resolved Shixiong Zhu
          68.
          State Store leaks temporary files Sub-task Resolved Tathagata Das
          69.
          Fix FileStreamSink with aggregation + watermark + append mode Sub-task Resolved Tathagata Das
          70.
          Rename triggerId to batchId in StreamingQueryStatus.triggerDetails Sub-task Resolved Tathagata Das
          71.
          Include triggerDetails in StreamingQueryStatus.json Sub-task Resolved Tathagata Das
          72.
          Improve docs on StreamingQueryListener and StreamingQuery.status Sub-task Resolved Tathagata Das
          73.
          Add StreamingQuery.status in python Sub-task Closed Tathagata Das
          74.
          Enable interrupts for HDFS in HDFSMetadataLog Sub-task Resolved Shixiong Zhu

          Activity

            jd Joseph Batchik added a comment -

            Would streaming DataFrames replace streaming RDDs or coincide with it?

            jd Joseph Batchik added a comment - Would streaming DataFrames replace streaming RDDs or coincide with it?
            adrian-wang Adrian Wang added a comment -

            https://github.com/intel-bigdata/spark-streamingsql
            Our streaming sql project is highly related to this jira ticket.

            adrian-wang Adrian Wang added a comment - https://github.com/intel-bigdata/spark-streamingsql Our streaming sql project is highly related to this jira ticket.
            smilegator Xiao Li added a comment -

            Streaming Dataframe will be built on Dataset APIs https://issues.apache.org/jira/browse/SPARK-9999?

            smilegator Xiao Li added a comment - Streaming Dataframe will be built on Dataset APIs https://issues.apache.org/jira/browse/SPARK-9999?
            chenghao Cheng Hao added a comment - - edited

            Remove the google docs link, as I cannot make it access for anyone when using the corp account. In the meantime, I put an pdf doc, hopefully helpful.

            chenghao Cheng Hao added a comment - - edited Remove the google docs link, as I cannot make it access for anyone when using the corp account. In the meantime, I put an pdf doc, hopefully helpful.
            smilegator Xiao Li added a comment -

            "You need permission to access this published document." I got this message when accessing it. Could you make it publicly available?

            Thank you!

            smilegator Xiao Li added a comment - "You need permission to access this published document." I got this message when accessing it. Could you make it publicly available? Thank you!
            chenghao Cheng Hao added a comment -

            This is a proposal for streaming dataframes that we were trying to work, hopefully helpful for the new design.

            chenghao Cheng Hao added a comment - This is a proposal for streaming dataframes that we were trying to work, hopefully helpful for the new design.
            ourui521314 Ou Rui added a comment -

            Which version will release this feature?I'd like to help test this feature in product

            ourui521314 Ou Rui added a comment - Which version will release this feature?I'd like to help test this feature in product
            linbojin Linbo added a comment -

            Spark 2.0 on April/May

            linbojin Linbo added a comment - Spark 2.0 on April/May

            Hi tdas,marmbrus,rxin

            Any docs on how to consume the new APIs (if they are available)? Or any pointers in the current code which I can go through to play around with the new feature.

            I see some test cases related to using the streams on read() but don't find any pointers to which class can be used in .format(). The test suite is working out of the DefaultSource class defined within the DataFrameReaderWriterSuite but I suppose there is would be something consumable in the source.

            Thanks

            Praveen

            praveend Praveen Devarao added a comment - Hi tdas , marmbrus , rxin Any docs on how to consume the new APIs (if they are available)? Or any pointers in the current code which I can go through to play around with the new feature. I see some test cases related to using the streams on read() but don't find any pointers to which class can be used in .format(). The test suite is working out of the DefaultSource class defined within the DataFrameReaderWriterSuite but I suppose there is would be something consumable in the source. Thanks Praveen
            tdas Tathagata Das added a comment -

            This is still highly WIP, and not ready for even experimental consumption. So please sit tight until we have something ready by Spark 2.0 release.

            tdas Tathagata Das added a comment - This is still highly WIP, and not ready for even experimental consumption. So please sit tight until we have something ready by Spark 2.0 release.

            Hi tdas

            Thanks for the update. I would like to join hands with you guys contributing to the dev efforts of structured streaming...could you let me know how can I be of help?

            We (@ IBM) are looking at contributing Complex Event Processing (CEP) feature to spark streaming and have done some initial work on supporting the same in Spark 1.6....now that we learn of structured streaming we want to ensure we get the CEP feature enabled on Spark 2.0 (as it makes more sense). For this we would like be part of the structured streaming efforts so that we get to understand it better and contributions will be inline with design.

            Let me know if you will need more information. We would be happy to have a call too for discussion on CEP (if you think that is better to start with).

            Thanks

            Praveen

            praveend Praveen Devarao added a comment - Hi tdas Thanks for the update. I would like to join hands with you guys contributing to the dev efforts of structured streaming...could you let me know how can I be of help? We (@ IBM) are looking at contributing Complex Event Processing (CEP) feature to spark streaming and have done some initial work on supporting the same in Spark 1.6....now that we learn of structured streaming we want to ensure we get the CEP feature enabled on Spark 2.0 (as it makes more sense). For this we would like be part of the structured streaming efforts so that we get to understand it better and contributions will be inline with design. Let me know if you will need more information. We would be happy to have a call too for discussion on CEP (if you think that is better to start with). Thanks Praveen
            rxin Reynold Xin added a comment -

            design doc - draft 1 - PDF version

            rxin Reynold Xin added a comment - design doc - draft 1 - PDF version
            rxin Reynold Xin added a comment -

            I've uploaded the first major design doc for this task – covering api and semantics. This is not set in stone, and we'd love to get some feedback and iterate on the model as well as the explanation of it. The best way is to comment on it directly in Google Docs.

            rxin Reynold Xin added a comment - I've uploaded the first major design doc for this task – covering api and semantics. This is not set in stone, and we'd love to get some feedback and iterate on the model as well as the explanation of it. The best way is to comment on it directly in Google Docs.
            rxin Reynold Xin added a comment -

            Note that there was an old streaming dataframe doc that was fairly incomplete. I moved it to SPARK-13875.

            rxin Reynold Xin added a comment - Note that there was an old streaming dataframe doc that was fairly incomplete. I moved it to SPARK-13875 .
            arnaud.bailly Arnaud Bailly added a comment -

            I have a question regarding the semantics of the "complete" output mode but I am not sure this is the right place to ask.
            Given some aggregation query I would expect a "complete" streaming request to result total aggregation over all values in the stream, past and new, but running a simple experiment with latest code at HEAD shows this is not the case : The streaming query returns result of running the query on new data only. My query looks something like:

            select key, sum(value) from table1 t1, stream2 t2 where t1.pk = t2.pk w group by key;
            

            with table1 a non-streaming DataFrame and stream2 a streaming DataFrame.

            Am I missing/misunderstanding something?

            arnaud.bailly Arnaud Bailly added a comment - I have a question regarding the semantics of the "complete" output mode but I am not sure this is the right place to ask. Given some aggregation query I would expect a "complete" streaming request to result total aggregation over all values in the stream, past and new, but running a simple experiment with latest code at HEAD shows this is not the case : The streaming query returns result of running the query on new data only. My query looks something like: select key, sum(value) from table1 t1, stream2 t2 where t1.pk = t2.pk w group by key; with table1 a non-streaming DataFrame and stream2 a streaming DataFrame. Am I missing/misunderstanding something?
            arnaud.bailly Arnaud Bailly added a comment -

            Of course, this is not a big deal when computing sum but fails when computing something like avg(value).

            arnaud.bailly Arnaud Bailly added a comment - Of course, this is not a big deal when computing sum but fails when computing something like avg(value) .

            This kind of question would be better asked on the spark-user list, or if you are sure its a bug in a separate JIRA. If you do go there, please include more of the code you are running.

            Really quickly though, your expectations about complete mode are correct. It should act as though you ran the query in batch on all the data that has been seen (although internally it should be computing this incrementally).

            marmbrus Michael Armbrust added a comment - This kind of question would be better asked on the spark-user list, or if you are sure its a bug in a separate JIRA. If you do go there, please include more of the code you are running. Really quickly though, your expectations about complete mode are correct. It should act as though you ran the query in batch on all the data that has been seen (although internally it should be computing this incrementally).
            arnaud.bailly Arnaud Bailly added a comment -

            Thanks a lot, and my apologies for the noise. I will try to define a smaller test case and file a bug.

            arnaud.bailly Arnaud Bailly added a comment - Thanks a lot, and my apologies for the noise. I will try to define a smaller test case and file a bug.

            We've got something in Spark 2.1 that works for streaming ETL from files or kafaka as well as basic evenTime windowed aggregations. To track further progress on the project checkout the Structured Streaming Component

            marmbrus Michael Armbrust added a comment - We've got something in Spark 2.1 that works for streaming ETL from files or kafaka as well as basic evenTime windowed aggregations. To track further progress on the project checkout the Structured Streaming Component

            People

              marmbrus Michael Armbrust
              rxin Reynold Xin
              Votes:
              30 Vote for this issue
              Watchers:
              92 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: