Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3621

Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.0.0, 1.1.0
    • None
    • Spark Core
    • None

    Description

      In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors).

      Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              xuefuz Xuefu Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: