[SPARK-13299] DataFrame limit operation is not consistent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.3.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
Fix Version/s: None
Component/s: None
Labels:
- SparkSQL
- dataframe

Description

I faced to a problem with using limit method from DataFrame API.
I try to get first 999 records from the AVRO source which contains about 3.5K records.

DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");

df = df.limit(999);

Then after saving operation I get the rows not in the same order as in input data set. Sometimes it gives me proper order but usually not.

df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);

Here you can see Spark plan (maybe it can help to figure out the cause of the issue):

== Parsed Logical Plan ==
Limit 999
 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Analyzed Logical Plan ==
mobileNumber: bigint, tariff: string, debit: float
Limit 999
 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Optimized Logical Plan ==
Limit 999
 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Physical Plan ==
Limit 999
 Scan AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)[mobileNumber#0L,tariff#1,debit#2]

Code Generation: true

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SparkLimitIssue.png
12/Feb/16 12:30
117 kB
Nazarii Balkovskyi

Activity

People

Assignee:: Unassigned

Reporter:: Nazarii Balkovskyi

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Feb/16 12:29

Updated:: 12/Feb/16 13:00

Resolved:: 12/Feb/16 13:00