Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360 Structured Streaming (aka Streaming DataFrames)
  3. SPARK-14832

Refactor DataSource to ensure schema is inferred only once when creating a file stream

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • Structured Streaming
    • None

    Description

      When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema

      • Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
      • Again, when creating streaming Source from the DataSource, in DataSource.createSource()

      Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema

      Attachments

        Activity

          People

            tdas Tathagata Das
            tdas Tathagata Das
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: