[HBASE-3604] Two region servers think that they own the same region: data loss - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.90.0
Fix Version/s: None
Component/s: regionserver
Labels:
None

Description

I observed this on a 100 node cluster that is constantly doing about 500K ops/second.

The region server on machine A was servicing IOs for a particular region. Then the machine went into a bad state where it is ping-able but not ssh-able. The master detected that there is a problem with machine A and reassigned the region to machine B. The regionserver on machine B opened the region and opened all the required HFiles for this region. After two hours, the NameNode received a delete request for one of the HFiles from machine A and happily renamed the file to HDFS-Trash. After another 3 hours or so, the regionserver on machine B tried to read contents from that HFile but failed because the file was renamed earlier. The region server on B in now stuck, and possible data loss.

The problems stems from the fact that although the master-and-ZK reassigned the region, the old regionserver was not possibly dead.

Attachments

Issue Links

is duplicated by

HBASE-2231 Compaction events should be written to HLog

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Dhruba Borthakur

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Mar/11 11:30

Updated:: 12/Jun/22 00:53

Resolved:: 04/Mar/11 19:02