[SPARK-11657] Bad Dataframe data read from parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.1, 1.5.2
Fix Version/s: 1.5.3, 1.6.0
Component/s: Spark Core, SQL
Labels:
None
Environment:

EMR (yarn)

Description

I get strange behaviour when reading parquet data:

scala> val data = sqlContext.read.parquet("hdfs:///sample")
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, clusterData: array<string>, dpid: int]
scala> data.take(1)    /// this returns garbage
res0: Array[org.apache.spark.sql.Row] = Array([1,56169A947F000101????????,WrappedArray(164594606101815510825479776971????????),813]) 
scala> data.collect()    /// this works
res1: Array[org.apache.spark.sql.Row] = Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])

I've attached the "hdfs:///sample" directory to this bug report

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample.tgz
11/Nov/15 15:48
1.0 kB
Virgil Palanciuc

Issue Links

is duplicated by

SPARK-11737 String may not be serialized correctly with Kyro

Resolved

SPARK-11331 Kryo serializer broken with StringTypes

Resolved

Activity

People

Assignee:: Davies Liu

Reporter:: Virgil Palanciuc

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Nov/15 15:46

Updated:: 23/Nov/15 14:15

Resolved:: 18/Nov/15 23:31