[HDFS-13132] Ozone: Handle datanode failures in Storage Container Manager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: HDFS-7240
Fix Version/s: HDFS-7240
Component/s: ozone
Labels:
None

Target Version/s:

HDFS-7240

Description

Currently SCM receives heartbeat from the datanodes in the cluster receiving container reports. Apart from this Ratis leader also receives the heartbeats from the nodes in a Raft ring. The ratis heartbeats are at a smaller interval (500 ms) whereas SCM heartbeats are at (30s), it is thereby considered safe to assume that a datanode is really lost when SCM missed heartbeat from such a node.

The pipeline recovery will follow the following steps

1) As noted earlier, SCM will identify a dead DN via the heartbeats. Current stale interval is (1.5m). Once a stale node has been identified, SCM will find the list of containers for the pipelines the datanode was part of.

2) SCM sends close container command to the datanodes, note that at this time, the Ratis ring has 2 nodes in the ring and consistency can still be guaranteed by Ratis.

3) If another node dies before the close container command succeeded, then ratis cannot guarantee consistency of the data being written/ close container. The pipeline here will be marked in a inconsistent state.

4) Closed container will be replicated via the close container replication protocol.
If the dead datanode comes back, as part of the re-register command, SCM will ask the Datanode to format all the open containers.

5) Return the healthy nodes back to the free node pool for the next pipeline allocation

6) Read operation to close containers will succeed however read operation to a open container on a single node cluster will be disallowed. It will only be allowed under a special flag aka ReadInconsistentData flag.

This jira will introduce the mechanism to identify and handle datanode failure.
However handling of a) 2 nodes simultaneously and b) Return the nodes to healthy state c) allow inconsistent data reads and d) purging of open container on a zombie node will be done as part of separate bugs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-13132-HDFS-7240.001.patch
11/Feb/18 14:57
55 kB
Mukul Kumar Singh
HDFS-13132-HDFS-7240.002.patch
12/Feb/18 07:24
56 kB
Mukul Kumar Singh

Issue Links

is a parent of

HDFS-13134 Ozone: Format open containers on datanode restart

Patch Available

Activity

People

Assignee:: Shashikant Banerjee

Reporter:: Mukul Kumar Singh

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Feb/18 14:55

Updated:: 28/May/18 07:30