[CRUNCH-485] groupByKey on Spark incorrect if key is Avro record with defined sort order - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11.0
Fix Version/s: 0.12.0
Component/s: Core
Labels:
None

Description

GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort order (http://avro.apache.org/docs/1.7.7/spec.html#order).

Instead, it serializes the entire avro record to a binary blob (byte array) and groups identical blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so it does take Avro's sort order into account.

The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal

groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext()))
          .mapToPair(new MapOutputFunction(keySerde, valueSerde))
          .groupByKey(numPartitions);

where MapOutputFunction simply converts the entire key object to a binary blob, without taking sort order into account.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CRUNCH-485.patch
08/Jan/15 02:54
11 kB
Josh Wills
CRUNCH-485b.patch
08/Jan/15 20:53
14 kB
Josh Wills

Activity

People

Assignee:: Josh Wills

Reporter:: Tycho Lamerigts

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jan/15 14:19

Updated:: 18/May/15 19:04

Resolved:: 09/Jan/15 21:23