Details
Description
I have the following schema in a dataset -
root
– userId: string (nullable = true) | ||
– data: map (nullable = true) | ||
– key: string | ||
– value: struct (valueContainsNull = true) | ||
– startTime: long (nullable = true) | ||
– endTime: long (nullable = true) | ||
– offset: long (nullable = true) |
And I have the following classes (+ setter and getters which I omitted for simplicity) -
public class MyClass { private String userId; private Map<String, MyDTO> data; private Long offset; } public class MyDTO { private long startTime; private long endTime; }
I collect the result the following way -
Encoder<MyClass> myClassEncoder = Encoders.bean(MyClass.class); Dataset<MyClass> results = raw_df.as(myClassEncoder); List<MyClass> lst = results.collectAsList();
I do several calculations to get the result I want and the result is correct all through the way before I collect it.
This is the result for -
results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
data[2017-07-01].startTime | data[2017-07-01].endTime |
-----------------------------------------+
1498854000 | 1498870800 |
This is the result after collecting the reuslts for -
MyClass userData = results.collectAsList().get(0); MyDTO userDTO = userData.getData().get("2017-07-01"); System.out.println("userDTO startTime: " + userDTO.getStartTime()); System.out.println("userDTO endTime: " + userDTO.getEndTime());
–
data startTime: 1498870800
data endTime: 1498854000
I tend to believe it is a spark issue. Would love any suggestions on how to bypass it.
Attachments
Issue Links
- is cloned by
-
SPARK-25772 Java encoders - switch fields on collectAsList
- Resolved
- is duplicated by
-
SPARK-21747 Java encoders - switch fields on collectAsList
- Resolved
- links to