[SPARK-14832] Refactor DataSource to ensure schema is inferred only once when creating a file stream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.0.0

Description

When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema

Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
Again, when creating streaming Source from the DataSource, in DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema

Attachments

Issue Links

links to

[Github] Pull Request #12591 (tdas)

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Apr/16 00:46

Updated:: 01/Nov/16 22:15

Resolved:: 23/Apr/16 00:18