[HDFS-550] DataNode restarts may introduce corrupt/duplicated/lost replicas when handling detached replicas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.21.0
Fix Version/s: Append Branch
Component/s: datanode
Labels:
None

Hadoop Flags:

Reviewed

Description

Current trunk first calls detach to unlinks a finalized replica before appending to this block. Unlink is done by temporally copying the block file in the "current" subtree to a directory called "detach" under the volume's daa directory and then copies it back when unlink succeeds. On datanode restarts, datanodes recover faied unlink by copying replicas under "detach" to "current".

There are two bugs with this implementation:
1. The "detach" directory does not include in a snapshot. so rollback will cause the "detaching" replicas to be lost.
2. After a replica is copied to the "detach" directory, the information of its original location is lost. The current implementation erroneously assumes that the replica to be unlinked is under "current". This will make two instances of replicas with the same block id to coexist in a datanode. Also if a replica under "detach" is corrupt, the corrupt replica is moved to "current" without being detected, polluting datanode data.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

detach.patch
01/Sep/09 18:55
21 kB
Hairong Kuang
detach1.patch
28/Sep/09 19:17
22 kB
Hairong Kuang
detach2.patch
28/Sep/09 22:05
23 kB
Hairong Kuang

Activity

People

Assignee:: Hairong Kuang

Reporter:: Hairong Kuang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/09 23:16

Updated:: 28/Sep/09 22:05

Resolved:: 28/Sep/09 22:05