Details
-
Bug
-
Status: Open
-
Not a Priority
-
Resolution: Unresolved
-
1.8.0
-
None
Description
When I try to read an Orc file using flink-orc an NullPointerException exception is thrown.
I think this issue could be related with this closed issue https://issues.apache.org/jira/browse/FLINK-8230
This happens when trying to read the string fields in a nested struct. This is my schema:
"struct<" + "operation:int," + "originalTransaction:bigInt," + "bucket:int," + "rowId:bigInt," + "currentTransaction:bigInt," + "row:struct<" + "id:int," + "headline:string," + "user_id:int," + "company_id:int," + "created_at:timestamp," + "updated_at:timestamp," + "link:string," + "is_html:tinyint," + "source:string," + "company_feed_id:int," + "editable:tinyint," + "body_clean:string," + "activitystream_activity_id:bigint," + "uniqueness_checksum:string," + "rating:string," + "review_id:int," + "soft_deleted:tinyint," + "type:string," + "metadata:string," + "url:string," + "imagecache_uuid:string," + "video_id:int" + ">>",
[error] Caused by: java.lang.NullPointerException [error] at java.lang.String.checkBounds(String.java:384) [error] at java.lang.String.<init>(String.java:462) [error] at org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216) [error] at org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328) [error] at org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215) [error] at org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453) [error] at org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250) [error] at org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143) [error] at org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333) [error] at org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313) [error] at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190) [error] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) [error] at java.lang.Thread.run(Thread.java:748)
Instead to use the TableApi I am trying to read the orc files in the Batch mode as following:
env .readFile( new OrcRowInputFormat( "", "SCHEMA_GIVEN_BEFORE", new HadoopConfiguration() ), "PATH_TO_FOLDER" ) .writeAsText("file:///tmp/test/fromOrc")
Thanks for your support