[SPARK-8360] Structured Streaming (aka Streaming DataFrames) - ASF JIRA

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: Structured Streaming
Labels:
None

Description

Umbrella ticket to track what's needed to make streaming DataFrame a reality.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
14/Mar/16 22:00
404 kB
Reynold Xin

Issue Links

incorporates

SPARK-16350 Complete output mode does not output updated aggregated value in Structured Streaming

Resolved

is duplicated by

SPARK-1363 Add streaming support for Spark SQL module

Resolved

relates to

SPARK-9999 Dataset API on top of Catalyst/DataFrame

Resolved

links to

Structured Streaming Programming Abstraction, Semantics, and APIs - Google Docs version

Sub-Tasks

1.	API design: convergence of batch and streaming DataFrame	Resolved	Reynold Xin
2.	Initial infrastructure	Resolved	Michael Armbrust
3.	API design: external state management	Closed	Unassigned
4.	API for managing streaming dataframes	Resolved	Tathagata Das
5.	Add FileStreamSource	Resolved	Shixiong Zhu
6.	Remove DataStreamReader/Writer	Resolved	Reynold Xin
7.	Rename DataFrameWriter.stream DataFrameWriter.startStream	Resolved	Reynold Xin
8.	State Store: A new framework for state management for computing Streaming Aggregates	Resolved	Tathagata Das
9.	Old streaming DataFrame proposal by Cheng Hao (Intel)	Closed	Cheng Hao
10.	WAL for determistic batches with IDs	Resolved	Michael Armbrust
11.	Simple FileSink for Parquet	Resolved	Michael Armbrust
12.	Windowing for structured streaming	Resolved	Burak Yavuz
13.	Add processing time trigger	Resolved	Shixiong Zhu
14.	Streaming Aggregation	Resolved	Michael Armbrust
15.	Method to determine if Dataset is bounded or not	Resolved	Burak Yavuz
16.	Memory Sink	Resolved	Michael Armbrust
17.	Define analysis rules for operations not supported in streaming	Resolved	Tathagata Das
18.	Python API for methods introduced for Structured Streaming	Resolved	Burak Yavuz
19.	Add partitioned parquet support file stream sink	Resolved	Tathagata Das
20.	Refactor DataSource to ensure schema is inferred only once when creating a file stream	Resolved	Tathagata Das
21.	Refactor StreamTests to test for source fault-tolerance correctly.	Resolved	Tathagata Das
22.	Add support in file stream source for reading new files added to subdirs	Resolved	Tathagata Das
23.	Add support for batch jobs correctly inferring partitions from data written with file stream sink	Resolved	Tathagata Das
24.	Disable support for multiple streaming aggregations	Resolved	Tathagata Das
25.	Disable schema inference for streaming datasets on file streams	Resolved	Tathagata Das
26.	Add support for complete output mode	Resolved	Tathagata Das
27.	Make continuous Parquet writes consistent with non-continuous Parquet writes	Closed	Unassigned
28.	Allow sorting on aggregated streaming dataframe when the output mode is Complete	Resolved	Tathagata Das
29.	Add support for socket stream.	Closed	Prashant Sharma
30.	Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery	Resolved	Shixiong Zhu
31.	Add a unique id to ContinuousQuery	Resolved	Tathagata Das
32.	Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer	Resolved	Tathagata Das
33.	Renamed ContinuousQuery to StreamingQuery for simplicity	Resolved	Tathagata Das
34.	Fix bug in python DataStreamReader	Resolved	Tathagata Das
35.	Properly explain the streaming queries	Resolved	Shixiong Zhu
36.	Fix complete mode aggregation with console sink	Resolved	Shixiong Zhu
37.	Sleep when no new data arrives to avoid 100% CPU usage	Resolved	Shixiong Zhu
38.	Enable test for sql/streaming.py and fix these tests	Resolved	Shixiong Zhu
39.	HDFSMetadataLog.get leaks the input stream	Resolved	Shixiong Zhu
40.	Add ContinuousQueryInfo to make ContinuousQueryListener events serializable	Resolved	Shixiong Zhu
41.	Add network word count example	Resolved	James Thomas
42.	StreamExecution.awaitOffset may take too long because of thread starvation	Resolved	Shixiong Zhu
43.	Fix flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite "event ordering"	Resolved	Shixiong Zhu
44.	Add a file sink log to support versioning and compaction	Resolved	Shixiong Zhu
45.	Fix a race condition in StreamExecution.processAllAvailable	Resolved	Shixiong Zhu
46.	Fix the race conditions in MemoryStream and MemorySink	Resolved	Shixiong Zhu
47.	Move FileSource offset log into checkpointLocation	Resolved	Shixiong Zhu
48.	Add a note to warn that onQueryProgress is asynchronous	Resolved	Shixiong Zhu
49.	QueryProgress should be post after committedOffsets is updated	Resolved	Shixiong Zhu
50.	StateStoreCoordinator should extend ThreadSafeRpcEndpoint	Resolved	Shixiong Zhu
51.	Allow multiple continuous queries to be started from the same DataFrame	Resolved	Shixiong Zhu
52.	Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite	Resolved	Shixiong Zhu
53.	Add MetadataLog and HDFSMetadataLog	Resolved	Shixiong Zhu
54.	ContinuousQueryManagerSuite floods the logs with garbage	Resolved	Shixiong Zhu
55.	Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering	Resolved	Shixiong Zhu
56.	Add ConsoleSink for structure streaming to display the dataframe on the fly	Resolved	Saisai Shao
57.	Flaky Test: Complete aggregation with Console sink	Resolved	Shixiong Zhu
58.	ConsoleSink should not require checkpointLocation	Resolved	Shixiong Zhu
59.	Add Structured Streaming Programming Guide	Resolved	Tathagata Das
60.	Move python DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming package	Resolved	Tathagata Das
61.	Add an option in file stream source to read 1 file at a time	Resolved	Tathagata Das
62.	Fix StreamingQueryListener to return message and stacktrace of actual exception	Resolved	Tathagata Das
63.	Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError	Resolved	Tathagata Das
64.	Metrics for Structured Streaming	Resolved	Tathagata Das
65.	Add methods to convert StreamingQueryStatus to json	Resolved	Tathagata Das
66.	History Server is broken because of the refactoring work in Structured Streaming	Resolved	Shixiong Zhu
67.	ForeachSink should fail the Spark job if `process` throws exception	Resolved	Shixiong Zhu
68.	State Store leaks temporary files	Resolved	Tathagata Das
69.	Fix FileStreamSink with aggregation + watermark + append mode	Resolved	Tathagata Das
70.	Rename triggerId to batchId in StreamingQueryStatus.triggerDetails	Resolved	Tathagata Das
71.	Include triggerDetails in StreamingQueryStatus.json	Resolved	Tathagata Das
72.	Improve docs on StreamingQueryListener and StreamingQuery.status	Resolved	Tathagata Das
73.	Add StreamingQuery.status in python	Closed	Tathagata Das
74.	Enable interrupts for HDFS in HDFSMetadataLog	Resolved	Shixiong Zhu

Activity

Ascending order - Click to sort in descending order

Joseph Batchik added a comment - 31/Jul/15 19:50

Would streaming DataFrames replace streaming RDDs or coincide with it?

Joseph Batchik added a comment - 31/Jul/15 19:50 Would streaming DataFrames replace streaming RDDs or coincide with it?

Adrian Wang added a comment - 26/Aug/15 02:29

https://github.com/intel-bigdata/spark-streamingsql
Our streaming sql project is highly related to this jira ticket.

Adrian Wang added a comment - 26/Aug/15 02:29 https://github.com/intel-bigdata/spark-streamingsql Our streaming sql project is highly related to this jira ticket.

Xiao Li added a comment - 20/Nov/15 06:11

Streaming Dataframe will be built on Dataset APIs https://issues.apache.org/jira/browse/SPARK-9999?

Xiao Li added a comment - 20/Nov/15 06:11 Streaming Dataframe will be built on Dataset APIs https://issues.apache.org/jira/browse/SPARK-9999?

Cheng Hao added a comment - 02/Dec/15 06:05 - edited

Remove the google docs link, as I cannot make it access for anyone when using the corp account. In the meantime, I put an pdf doc, hopefully helpful.

Cheng Hao added a comment - 02/Dec/15 06:05 - edited Remove the google docs link, as I cannot make it access for anyone when using the corp account. In the meantime, I put an pdf doc, hopefully helpful.

Xiao Li added a comment - 02/Dec/15 06:50

"You need permission to access this published document." I got this message when accessing it. Could you make it publicly available?

Thank you!

Xiao Li added a comment - 02/Dec/15 06:50 "You need permission to access this published document." I got this message when accessing it. Could you make it publicly available? Thank you!

Cheng Hao added a comment - 02/Dec/15 12:13

This is a proposal for streaming dataframes that we were trying to work, hopefully helpful for the new design.

Cheng Hao added a comment - 02/Dec/15 12:13 This is a proposal for streaming dataframes that we were trying to work, hopefully helpful for the new design.

Ou Rui added a comment - 08/Jan/16 10:35

Which version will release this feature?I'd like to help test this feature in product

Ou Rui added a comment - 08/Jan/16 10:35 Which version will release this feature?I'd like to help test this feature in product

Linbo added a comment - 23/Feb/16 06:27

Spark 2.0 on April/May

Linbo added a comment - 23/Feb/16 06:27 Spark 2.0 on April/May

Praveen Devarao added a comment - 08/Mar/16 16:27

Hi tdas,marmbrus,rxin

Any docs on how to consume the new APIs (if they are available)? Or any pointers in the current code which I can go through to play around with the new feature.

I see some test cases related to using the streams on read() but don't find any pointers to which class can be used in .format(). The test suite is working out of the DefaultSource class defined within the DataFrameReaderWriterSuite but I suppose there is would be something consumable in the source.

Thanks

Praveen

Praveen Devarao added a comment - 08/Mar/16 16:27 Hi tdas , marmbrus , rxin Any docs on how to consume the new APIs (if they are available)? Or any pointers in the current code which I can go through to play around with the new feature. I see some test cases related to using the streams on read() but don't find any pointers to which class can be used in .format(). The test suite is working out of the DefaultSource class defined within the DataFrameReaderWriterSuite but I suppose there is would be something consumable in the source. Thanks Praveen

Tathagata Das added a comment - 11/Mar/16 03:22

This is still highly WIP, and not ready for even experimental consumption. So please sit tight until we have something ready by Spark 2.0 release.

Tathagata Das added a comment - 11/Mar/16 03:22 This is still highly WIP, and not ready for even experimental consumption. So please sit tight until we have something ready by Spark 2.0 release.

Praveen Devarao added a comment - 14/Mar/16 16:37

Hi tdas

Thanks for the update. I would like to join hands with you guys contributing to the dev efforts of structured streaming...could you let me know how can I be of help?

We (@ IBM) are looking at contributing Complex Event Processing (CEP) feature to spark streaming and have done some initial work on supporting the same in Spark 1.6....now that we learn of structured streaming we want to ensure we get the CEP feature enabled on Spark 2.0 (as it makes more sense). For this we would like be part of the structured streaming efforts so that we get to understand it better and contributions will be inline with design.

Let me know if you will need more information. We would be happy to have a call too for discussion on CEP (if you think that is better to start with).

Thanks

Praveen

Praveen Devarao added a comment - 14/Mar/16 16:37 Hi tdas Thanks for the update. I would like to join hands with you guys contributing to the dev efforts of structured streaming...could you let me know how can I be of help? We (@ IBM) are looking at contributing Complex Event Processing (CEP) feature to spark streaming and have done some initial work on supporting the same in Spark 1.6....now that we learn of structured streaming we want to ensure we get the CEP feature enabled on Spark 2.0 (as it makes more sense). For this we would like be part of the structured streaming efforts so that we get to understand it better and contributions will be inline with design. Let me know if you will need more information. We would be happy to have a call too for discussion on CEP (if you think that is better to start with). Thanks Praveen

Reynold Xin added a comment - 14/Mar/16 22:00

design doc - draft 1 - PDF version

Reynold Xin added a comment - 14/Mar/16 22:00 design doc - draft 1 - PDF version

Reynold Xin added a comment - 14/Mar/16 22:10

I've uploaded the first major design doc for this task – covering api and semantics. This is not set in stone, and we'd love to get some feedback and iterate on the model as well as the explanation of it. The best way is to comment on it directly in Google Docs.

Reynold Xin added a comment - 14/Mar/16 22:10 I've uploaded the first major design doc for this task – covering api and semantics. This is not set in stone, and we'd love to get some feedback and iterate on the model as well as the explanation of it. The best way is to comment on it directly in Google Docs.

Reynold Xin added a comment - 14/Mar/16 22:28

Note that there was an old streaming dataframe doc that was fairly incomplete. I moved it to ~~SPARK-13875~~.

Reynold Xin added a comment - 14/Mar/16 22:28 Note that there was an old streaming dataframe doc that was fairly incomplete. I moved it to SPARK-13875 .

Arnaud Bailly added a comment - 01/Jul/16 13:20

I have a question regarding the semantics of the "complete" output mode but I am not sure this is the right place to ask.
Given some aggregation query I would expect a "complete" streaming request to result total aggregation over all values in the stream, past and new, but running a simple experiment with latest code at HEAD shows this is not the case : The streaming query returns result of running the query on new data only. My query looks something like:

select key, sum(value) from table1 t1, stream2 t2 where t1.pk = t2.pk w group by key;

with table1 a non-streaming DataFrame and stream2 a streaming DataFrame.

Am I missing/misunderstanding something?

Arnaud Bailly added a comment - 01/Jul/16 13:20 I have a question regarding the semantics of the "complete" output mode but I am not sure this is the right place to ask. Given some aggregation query I would expect a "complete" streaming request to result total aggregation over all values in the stream, past and new, but running a simple experiment with latest code at HEAD shows this is not the case : The streaming query returns result of running the query on new data only. My query looks something like: select key, sum(value) from table1 t1, stream2 t2 where t1.pk = t2.pk w group by key; with table1 a non-streaming DataFrame and stream2 a streaming DataFrame. Am I missing/misunderstanding something?

Arnaud Bailly added a comment - 01/Jul/16 14:26

Of course, this is not a big deal when computing sum but fails when computing something like avg(value).

Arnaud Bailly added a comment - 01/Jul/16 14:26 Of course, this is not a big deal when computing sum but fails when computing something like avg(value) .

Michael Armbrust added a comment - 01/Jul/16 19:32

This kind of question would be better asked on the spark-user list, or if you are sure its a bug in a separate JIRA. If you do go there, please include more of the code you are running.

Really quickly though, your expectations about complete mode are correct. It should act as though you ran the query in batch on all the data that has been seen (although internally it should be computing this incrementally).

Michael Armbrust added a comment - 01/Jul/16 19:32 This kind of question would be better asked on the spark-user list, or if you are sure its a bug in a separate JIRA. If you do go there, please include more of the code you are running. Really quickly though, your expectations about complete mode are correct. It should act as though you ran the query in batch on all the data that has been seen (although internally it should be computing this incrementally).

Arnaud Bailly added a comment - 01/Jul/16 19:44

Thanks a lot, and my apologies for the noise. I will try to define a smaller test case and file a bug.

Arnaud Bailly added a comment - 01/Jul/16 19:44 Thanks a lot, and my apologies for the noise. I will try to define a smaller test case and file a bug.

Michael Armbrust added a comment - 01/Nov/16 23:44

We've got something in Spark 2.1 that works for streaming ETL from files or kafaka as well as basic evenTime windowed aggregations. To track further progress on the project checkout the Structured Streaming Component

Michael Armbrust added a comment - 01/Nov/16 23:44 We've got something in Spark 2.1 that works for streaming ETL from files or kafaka as well as basic evenTime windowed aggregations. To track further progress on the project checkout the Structured Streaming Component

People

Assignee:: Michael Armbrust

Reporter:: Reynold Xin

Votes:: 30 Vote for this issue

Watchers:: 92 Start watching this issue

Dates

Created:: 14/Jun/15 07:26

Updated:: 01/Nov/16 23:44

Resolved:: 01/Nov/16 23:44