Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0
-
None
Description
Right now, the Dataset API only offers two possibilities for explicitly repartitioning a dataset:
- round robin partitioning, via def repartition(numPartitions: Int)
- hash partitioning, via def repartition(numPartitions: Int, partitionExprs: Column*)
It would be useful to also expose range partitioning, which can, for example, improve compression when writing data out to disk, or potentially enable new use cases.
Attachments
Issue Links
- is related to
-
SPARK-22624 Expose range partitioning shuffle introduced by SPARK-22614
- Resolved
- links to