[FLINK-33611] Support Large Protobuf Schemas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.18.0
Fix Version/s: 1.20.0
Component/s: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Labels:
- pull-request-available

Description

Background

Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. ~~FLINK-32650~~ has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size limit and we can see errors like "Too many constants" when trying to compile the generated code.

Solution

Since we already have the split code functionality already introduced, the main proposal here is to now reuse the variable names across different split method scopes. This will greatly reduce the constant pool size. One more optimization is to only split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods and thus exceed the constant pool size limit

Attachments

Issue Links

causes

FLINK-34408 VeryBigPbProtoToRowTest#testSimple fails with OOM

Closed

is related to

FLINK-34403 VeryBigPbProtoToRowTest#testSimple cannot pass due to OOM

Resolved

links to

GitHub Pull Request #23937

Activity

People

Assignee:: Sai Sharath Dandi

Reporter:: Sai Sharath Dandi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Nov/23 17:26

Updated:: 07/Feb/24 12:43

Resolved:: 07/Feb/24 05:05