Details
Description
Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it. For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.
Here are a few I'd prioritize (for my use cases!)
- randomSplit
- sampleByKey + sampleByKeyExact
- Q: Should "key" be a single column, or should we support using a set of columns as a key?
Attachments
Issue Links
- is related to
-
SPARK-7156 Add randomSplit method to DataFrame
- Resolved
-
SPARK-7157 Add approximate stratified sampling to DataFrame
- Resolved