[HDFS-16902] Add Namenode status to BPServiceActor metrics and improve logging in offerservice - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.3.6
Fix Version/s: 3.4.0, 3.2.5, 3.3.6
Component/s: namanode
Labels:
- pull-request-available

Target Version/s:

3.4.0, 3.3.6
Hadoop Flags:

Reviewed

Description

Recently came across an k8s environment where randomly some datanode pods are not able to stay connected to all namenode pods (e.g. last heartbeat time stays higher than 2 hr sometimes). When any standby namenode becomes active, any datanode that is not heartbeating to it for quite sometime would not be able to send any further block reports, leading to missing replicas immediately after namenode failover, which could only be resolved with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use some logging improvements. It is also good to get namenode status exposed with BPServiceActorInfo to identify any lags from datanode side in recognizing updated Active namenode status with heartbeats.

Attachments

Issue Links

links to

GitHub Pull Request #5334

Activity

People

Assignee:: Viraj Jasani

Reporter:: Viraj Jasani

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Feb/23 06:05

Updated:: 28/Jan/24 07:22

Resolved:: 03/Feb/23 02:17