Description
GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort order (http://avro.apache.org/docs/1.7.7/spec.html#order).
Instead, it serializes the entire avro record to a binary blob (byte array) and groups identical blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so it does take Avro's sort order into account.
The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal
groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext())) .mapToPair(new MapOutputFunction(keySerde, valueSerde)) .groupByKey(numPartitions);
where MapOutputFunction simply converts the entire key object to a binary blob, without taking sort order into account.