[HDDS-10465] Change ozone.client.bytes.per.checksum default to 16KB - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: None
Labels:
- pull-request-available

Description

When using TestDFSIO to compare the random read performance of HDFS and Ozone, Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.

Test Suit: TestDFSIO

Number of files: 64

File Size: 1024MB

Random read(execution time)	Round1(s)	Round2(s)
HDFS	47.06	49.5
Ozone	147.31	149.47

And for Ozone itself, sequence read is must faster than random read:

Ozone	Round1(s)	Round2(s)	Round3(s)
read execution time	66.62	58.78	68.98
random read execution time	147.31	149.47	147.09

While for HDFS, there is no much gap between its sequence read and random read execution time:

HDFS	Round1(s)	Round2(s)
read execution time	51.53	44.88
random read execution time	47.06	49.5

After some investigation, it's found that the total bytes read from DN in TestDFSIO random read test is almost double the data size. Here the total data to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk metric value is increased by 128GB after one test run. The root cause is when client reads data, it will align the requested data size with "ozone.client.bytes.per.checksum" which is 1MB currently. For example, if reading 1 byte, client will send request to DN to fetch 1MB data. If reading 2 bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will send request to DN to fetch the first 1MB for first byte data, and the second 1MB for second byte data. In the random read mode, TestDFSIO use a read buffer with size 1000000 = 976.5KB, that's why the total bytes read from DN is double the size.

According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by default.

To improve the Ozone random read performance, a straightforward idea is to use a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)

ozone.client.bytes .per.checksum	write1(s)	write2(s)	write3(s)	read1(s)	read2(s)	read3(s)	read average	random read1	random read2	random read3	random average
1MB	163.01	163.34	141.9	47.25	51.86	52.02	50.28	114.42	90.38	97.83	100.88
16KB	160.6	144.43	165.08	63.36	67.68	69.94	66.89	55.94	72.14	55.43	61.17
8KB	149.97	161.01	161.57	66.46	61.61	63.17	63.75	62.06	71.93	58.56	64.18

From the above data, we can see that for same amount of data

write, the execution time have no obvious differences in all there cases
sequential read, 1MB bytes.per.checksum has best execution time. 16KB and 8KB has the close execution time.
random read, 1MB has the worst execution time. 16KB and 8KB has the close execution time.
For either 16KB or 8KB bytes.per.checksum, their sequential read and random read has close execution time, similar to HDFS behavior.

Change bytes.per.checksum from 1MB to 16KB, although the sequential read performance will drop a bit, but the performance gain in random read is much higher than that. Applications which leverage random read a lot, such as HBASE, Impala, Iceberg(Parquet) will all benefit from this.

So this task propose to change the ozone.client.file.bytes-per-checksum default value from current 1MB to 16KB, and lower the current min limit of the property from 16KB to 8KB, to improve the overall read performance.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-03-07-18-23-23-121.png
07/Mar/24 10:23
96 kB
Sammi Chen

Issue Links

is related to

HDDS-10043 java.lang.OutOfMemoryError on FSDataOutputStream write ops

Resolved

HDDS-7593 Supporting HSync and lease recovery

Resolved

links to

GitHub Pull Request #6331

Activity

People

Assignee:: Sammi Chen

Reporter:: Sammi Chen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Mar/24 13:50

Updated:: 27/May/24 04:49

Resolved:: 06/May/24 16:21