Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0-incubating
-
None
Description
When Trafodion is stopped abruptly when a region server has current recovery requests posted in Zookeeper, the new TMs may be unable to start. This happens because the TM recovery thread reads the ZK entries and attempts to send the recovery resolution to the region server that posted the entry. It gets a connection error because that region server no longer exists.
The partial solution is to remove the ZK entries as part of startup so the TM can startup without error.
THis is safe to do because any region server needing recovery will repost to zookeeper and the TM will have no issues connecting to this RS.
An additional fix will be made to the TM to handle exceptions in trying to communicate with region servers during recovery.