Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17325

AQE should use available column statistics from completed query stages

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • None
    • Rust - Ballista
    • None

    Description

      In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics.

      Only dataSize and numOutputRows are copied into the new Statistics object:

       

        def computeStats(): Option[Statistics] = if (isMaterialized) {
          val runtimeStats = getRuntimeStatistics
          val dataSize = runtimeStats.sizeInBytes.max(0)
          val numOutputRows = runtimeStats.rowCount.map(_.max(0))
          Some(Statistics(dataSize, numOutputRows, isRuntime = true))
        } else {
          None
        }
      

      I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase. This is a small change as shown below:

        def computeStats(): Option[Statistics] = if (isMaterialized) {
          val runtimeStats = getRuntimeStatistics
          val dataSize = runtimeStats.sizeInBytes.max(0)
          val numOutputRows = runtimeStats.rowCount.map(_.max(0))
          val attributeStats = runtimeStats.attributeStats
          Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = true))
        } else {
          None
        }
      

      The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics, but other custom implementations can.

      Attachments

        Activity

          People

            Unassigned Unassigned
            andygrove Andy Grove
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: