Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-485

groupByKey on Spark incorrect if key is Avro record with defined sort order

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11.0
    • 0.12.0
    • Core
    • None

    Description

      GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort order (http://avro.apache.org/docs/1.7.7/spec.html#order).

      Instead, it serializes the entire avro record to a binary blob (byte array) and groups identical blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so it does take Avro's sort order into account.

      The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal

      groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext()))
                .mapToPair(new MapOutputFunction(keySerde, valueSerde))
                .groupByKey(numPartitions);
      

      where MapOutputFunction simply converts the entire key object to a binary blob, without taking sort order into account.

      Attachments

        1. CRUNCH-485.patch
          11 kB
          Josh Wills
        2. CRUNCH-485b.patch
          14 kB
          Josh Wills

        Activity

          People

            jwills Josh Wills
            tychol Tycho Lamerigts
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: