[HDFS-14617] Improve fsimage load time by writing sub-sections to the fsimage index - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.10.0, 3.3.0
Component/s: namenode
Labels:
None

Target Version/s:

3.3.0
Release Note:

Hide
This change allows the inode and inode directory sections of the fsimage to be loaded in parallel. Tests on large images have shown this change to reduce the image load time to about 50% of the pre-change run time.

It works by writing sub-section entries to the image index, effectively splitting each image section into many sub-sections which can be processed in parallel. By default 12 sub-sections per image section are created when the image is saved, and 4 threads are used to load the image at startup.

This is disabled by default for any image with more than 1M inodes (dfs.image.parallel.inode.threshold) and can be enabled by setting dfs.image.parallel.load to true. When the feature is enabled, the next HDFS checkpoint will write the image sub-sections and subsequent namenode restarts can load the image in parallel.

A image with the parallel sections can be read even if the feature is disabled, but HDFS versions without this Jira cannot load an image with parallel sections. OIV can process a parallel enabled image without issues.

Key configuration parameters are:

dfs.image.parallel.load=false - enable or disable the feature

dfs.image.parallel.target.sections = 12 - The target number of subsections. Aim for 2 to 3 times the number of dfs.image.parallel.threads.

dfs.image.parallel.inode.threshold = 1000000 - Only save and load in parallel if the image has more than this number of inodes.

dfs.image.parallel.threads = 4 - The number of threads used to load the image. Testing has shown 4 to be optimal, but this may depends on the environment

Show
This change allows the inode and inode directory sections of the fsimage to be loaded in parallel. Tests on large images have shown this change to reduce the image load time to about 50% of the pre-change run time. It works by writing sub-section entries to the image index, effectively splitting each image section into many sub-sections which can be processed in parallel. By default 12 sub-sections per image section are created when the image is saved, and 4 threads are used to load the image at startup. This is disabled by default for any image with more than 1M inodes (dfs.image.parallel.inode.threshold) and can be enabled by setting dfs.image.parallel.load to true. When the feature is enabled, the next HDFS checkpoint will write the image sub-sections and subsequent namenode restarts can load the image in parallel. A image with the parallel sections can be read even if the feature is disabled, but HDFS versions without this Jira cannot load an image with parallel sections. OIV can process a parallel enabled image without issues. Key configuration parameters are: dfs.image.parallel.load=false - enable or disable the feature dfs.image.parallel.target.sections = 12 - The target number of subsections. Aim for 2 to 3 times the number of dfs.image.parallel.threads. dfs.image.parallel.inode.threshold = 1000000 - Only save and load in parallel if the image has more than this number of inodes. dfs.image.parallel.threads = 4 - The number of threads used to load the image. Testing has shown 4 to be optimal, but this may depends on the environment

Description

Loading an fsimage is basically a single threaded process. The current fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, Snapshot_Diff etc. Then at the end of the file, an index is written that contains the offset and length of each section. The image loader code uses this index to initialize an input stream to read and process each section. It is important that one section is fully loaded before another is started, as the next section depends on the results of the previous one.

What I would like to propose is the following:

1. When writing the image, we can optionally output sub_sections to the index. That way, a given section would effectively be split into several sections, eg:

   inode_section offset 10 length 1000
     inode_sub_section offset 10 length 500
     inode_sub_section offset 510 length 500
     
   inode_dir_section offset 1010 length 1000
     inode_dir_sub_section offset 1010 length 500
     inode_dir_sub_section offset 1010 length 500

Here you can see we still have the original section index, but then we also have sub-section entries that cover the entire section. Then a processor can either read the full section in serial, or read each sub-section in parallel.

2. In the Image Writer code, we should set a target number of sub-sections, and then based on the total inodes in memory, it will create that many sub-sections per major image section. I think the only sections worth doing this for are inode, inode_reference, inode_dir and snapshot_diff. All others tend to be fairly small in practice.

3. If there are under some threshold of inodes (eg 10M) then don't bother with the sub-sections as a serial load only takes a few seconds at that scale.

4. The image loading code can then have a switch to enable 'parallel loading' and a 'number of threads' where it uses the sub-sections, or if not enabled falls back to the existing logic to read the entire section in serial.

Working with a large image of 316M inodes and 35GB on disk, I have a proof of concept of this change working, allowing just inode and inode_dir to be loaded in parallel, but I believe inode_reference and snapshot_diff can be make parallel with the same technique.

Some benchmarks I have are as follows:

Threads   1     2     3     4 
--------------------------------
inodes    448   290   226   189 
inode_dir 326   211   170   161 
Total     927   651   535   488 (MD5 calculation about 100 seconds)

The above table shows the time in seconds to load the inode section and the inode_directory section, and then the total load time of the image.

With 4 threads using the above technique, we are able to better than half the load time of the two sections. With the patch in ~~HDFS-13694~~ it would take a further 100 seconds off the run time, going from 927 seconds to 388, which is a significant improvement. Adding more threads beyond 4 has diminishing returns as there are some synchronized points in the loading code to protect the in memory structures.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flamegraph.serial.svg
04/Aug/19 07:50
59 kB
Xiaoqiao He
flamegraph.parallel.svg
04/Aug/19 07:50
69 kB
Xiaoqiao He
SerialLoading.svg
19/Jul/19 15:17
88 kB
Xiaoqiao He
ParallelLoading.svg
19/Jul/19 15:17
88 kB
Xiaoqiao He
dirs-single.svg
12/Jul/19 08:05
51 kB
Stephen O'Donnell
inodes.svg
12/Jul/19 08:05
32 kB
Stephen O'Donnell
HDFS-14617.001.patch
28/Jun/19 15:19
31 kB
Stephen O'Donnell

Issue Links

is fixed by

HDFS-16147 load fsimage with parallelization and compression

Patch Available

is related to

HDFS-15792 ClasscastException while loading FSImage

Resolved

HDFS-15493 Update block map and name cache in parallel while loading fsimage.

Resolved

relates to

HDFS-15985 Incorrect sorting will cause failure to load an FsImage file

In Progress

HDFS-14821 Make HDFS-14617 (fsimage sub-sections) off by default

Resolved

HDFS-15830 Support to make dfs.image.parallel.load reconfigurable

Resolved

HDFS-17573 Allow turn on both FSImage parallelization and compression

Resolved

HDFS-7784 load fsimage in parallel

Resolved

supercedes

HDFS-7784 load fsimage in parallel

Resolved

links to

GitHub Pull Request #1028

(3 relates to, 1 supercedes, 1 links to)

Sub-Tasks

1.

Upgrade OfflineImageViewer to be compatible with new FSImage format

Open

Unassigned

2.

Incorrect sorting will cause failure to load an FsImage file

In Progress

JiangHua Zhu

100%

3.

Improve oiv tool to parse fsimage file in parallel with XML format

Open

Hongbing Wang

Improve fsimage load time by writing sub-sections to the fsimage index

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Time Tracking

Not Specified