Description
Background
Apache Doris is a modern data warehouse for real-time analytics.
It delivers lightning-fast analytics on real-time data at scale.
Objectives
Dictionary encoding optimization
To save storage space, Doris uses dictionary encoding when storing string-type data in the storage layer if the cardinality is relatively low. Dictionary encoding involves mapping string values to integer values using a dictionary. The data can be stored directly as integers, and the dictionary information is stored separately. When reading the data, the integers are converted back to their corresponding string values based on the dictionary.
The storage layer doesn't know whether a column has low or high cardinality when the data comes in. Currently, the implementation encodes the first page using dictionary encoding, and if the dictionary becomes too large, it indicates a column with high cardinality. Subsequent pages will not use dictionary encoding. However, even for columns with high cardinality, a dictionary page is still retained, which doesn't save storage space and adds additional memory overhead during reading as well as extra CPU overhead during decoding.
Optimizations can be made to improve the memory and CPU overhead caused by dictionary encoding.
Recommended Skills
Familiar with C++ programming
Familiar with the storage layer of Doris
Mentor
Mentor: Xin Liao, Apache Doris Committer, liaoxinbit@gmail.com
Mentor: YongQiang Yang, Apache Doris PMC Member, dataroaring@gmail.com
Mailing List: dev@doris.apache.org
Website: https://doris.apache.org
Source Code: https://github.com/apache/doris