[HDFS-15987] Improve oiv tool to parse fsimage file in parallel with delimited format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: tools
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

The purpose of this Jira is to improve oiv tool to parse fsimage file with sub-sections (see ~~HDFS-14617~~) in parallel with delmited format.

1.Serial parsing is time-consuming

The time to serially parse a large fsimage with delimited format (e.g. `hdfs oiv -p Delimited -t <tmp> ...`) is as follows:

1) Loading string table:                 -> Not time consuming.
2) Loading inode references:             -> Not time consuming
3) Loading directories in INode section: -> Slightly time consuming (3%)
4) Loading INode directory section:      -> A bit time consuming (11%)
5) Output:                               -> Very time consuming (86%)

Therefore, output is the most parallelized stage.

2.How to output in parallel

The sub-sections are grouped in order, and each thread processes a group and outputs it to the file corresponding to each thread, and finally merges the output files.

3. The result of a test

 input fsimage file info:
 3.4G, 12 sub-sections, 55976500 INodes
 -----------------------------------------
 Threads TotalTime OutputTime MergeTime
 1       18m37s     16m18s      –
 4        8m7s      4m49s       41s

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Improve_oiv_tool_001.pdf
27/Aug/21 08:36
76 kB
Hongbing Wang

Issue Links

links to

GitHub Pull Request #2918

Activity

People

Assignee:: Hongbing Wang

Reporter:: Hongbing Wang

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 16/Apr/21 08:55

Updated:: 12/Feb/24 06:39

Resolved:: 22/Mar/22 14:34

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

6h 40m