Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.1.7
-
None
-
Reviewed
Description
In Assignment#processRegionsInTransition, when handling regions with M_ZK_REGION_OFFLINE state, we used a handler to reassign this region. But, when calling assign, we passed not to set the zk node
case M_ZK_REGION_OFFLINE: // Insert in RIT and resend to the regionserver regionStates.updateRegionState(rt, State.PENDING_OPEN); final RegionState rsOffline = regionStates.getRegionState(regionInfo); this.executorService.submit( new EventHandler(server, EventType.M_MASTER_RECOVERY) { @Override public void process() throws IOException { ReentrantLock lock = locker.acquireLock(regionInfo.getEncodedName()); try { RegionPlan plan = new RegionPlan(regionInfo, null, sn); addPlan(encodedName, plan); assign(rsOffline, false, false); //we decide to not to setOfflineInZK } finally { lock.unlock(); } } }); break;
But, when setOfflineInZK is false, we passed a zk node vesion of -1 to the regionserver, meaning the zk node does not exists. But actually the offline zk node does exist with a different version. RegionServer will report fail to open because of this.
This situation is trully happened in our test environment. Though the master will recevied the FAILED_OPEN zk event and retry later, but due to a another bug(HBASE-17265). The Region will be remain in closed state forever.
Master assign region in RIT
2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] master.AssignmentManager: Processing 57513956a7b671f4e8da1598c2e2970e in state: M_ZK_REGION_OFFLINE 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] master.RegionStates: Transition {57513956a7b671f4e8da1598c2e2970e state=OFFLINE, ts=1479892306738, server=example.org,30003,1475893095003} to {57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, ts=1479892306842, server=example.org,30003,1479780976834} 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] master.AssignmentManager: Processed region 57513956a7b671f4e8da1598c2e2970e in state M_ZK_REGION_OFFLINE, on server: example.org,30003,1479780976834 2016-11-23 17:11:46,843 INFO [MASTER_SERVER_OPERATIONS-example.org:30001-0] master.AssignmentManager: Assigning test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e. to example.org,30003,1479780976834
RegionServer recevied the open region request, and new a RegionOpenHandler to open the region, but only to find the RIT node's version is not as it expected. RS transition the RIT ZK node to failed open in the end
2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1] coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to OPENING for region=57513956a7b671f4e8da1598c2e2970e 2016-11-23 17:11:46,861 WARN [RS_OPEN_REGION-example.org:30003-1] handler.OpenRegionHandler: Region was hijacked? Opening cancelled for encodedName=57513956a7b671f4e8da1598c2e2970e 2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1] zookeeper.ZKAssign: regionserver:30003-0x15810b5f633015f, quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, baseZNode=/test-hbase11-func2 Attempt to transition the unassigned node for 57513956a7b671f4e8da1598c2e2970e from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was version 3 not the expected version -1
Master recevied this zk event and begin to handle RS_ZK_REGION_FAILED_OPEN
2016-11-23 17:11:46,944 DEBUG [AM.ZK.Worker-pool2-t1] master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN, server=example.org,30003,1479780976834, region=57513956a7b671f4e8da1598c2e2970e, current_state={57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, ts=1479892306843, server=example.org,30003,1479780976834}
Attachments
Attachments
Issue Links
- is related to
-
HBASE-17275 Assign timeout may cause region to be unassigned forever
- Closed
- relates to
-
HBASE-17265 Region left unassigned in master failover when region failed to open
- Closed