[SPARK-6292] Add RDD methods to DataFrame to preserve schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it. For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.

Here are a few I'd prioritize (for my use cases!)

randomSplit
sampleByKey + sampleByKeyExact
- Q: Should "key" be a single column, or should we support using a set of columns as a key?

Attachments

Issue Links

is related to

SPARK-7156 Add randomSplit method to DataFrame

Resolved

SPARK-7157 Add approximate stratified sampling to DataFrame

Resolved

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Joseph K. Bradley

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Mar/15 23:53

Updated:: 27/Apr/15 01:13

Resolved:: 27/Apr/15 01:13