Details
Description
I faced to a problem with using limit method from DataFrame API.
I try to get first 999 records from the AVRO source which contains about 3.5K records.
DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");
df = df.limit(999);
Then after saving operation I get the rows not in the same order as in input data set. Sometimes it gives me proper order but usually not.
df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
Here you can see Spark plan (maybe it can help to figure out the cause of the issue):
== Parsed Logical Plan == Limit 999 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Analyzed Logical Plan == mobileNumber: bigint, tariff: string, debit: float Limit 999 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Optimized Logical Plan == Limit 999 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Physical Plan == Limit 999 Scan AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)[mobileNumber#0L,tariff#1,debit#2] Code Generation: true