Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The test is simple so it is easy to find out the problem.
2018-02-21 04:53:32,230 INFO [Time-limited test] hbase.ResourceChecker(148): before: master.locking.TestLockProcedure#testTimeout Thread=218, OpenFileDescriptor=853, MaxFileDescriptor=60000, SystemLoadAverage=5075, ProcessCount=312, AvailableMemoryMB=5373 2018-02-21 04:53:32,234 WARN [Time-limited test] procedure2.ProcedureTestingUtility(146): Set Kill before store update to: false 2018-02-21 04:53:32,278 DEBUG [Time-limited test] procedure2.ProcedureExecutor(866): Stored pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:32,285 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:32,286 DEBUG [PEWorker-1] locking.LockProcedure(312): LOCKED pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:32,303 DEBUG [Time-limited test] locking.TestLockProcedure(204): Proc id 14 acquired lock. 2018-02-21 04:53:32,298 INFO [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188814298 2018-02-21 04:53:33,303 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:33,304 DEBUG [Time-limited test] locking.TestLockProcedure(225): Proc id 14 : LOCKED. 2018-02-21 04:53:34,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:34,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:34,299 INFO [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188816299 2018-02-21 04:53:34,306 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:34,306 DEBUG [Time-limited test] locking.TestLockProcedure(225): Proc id 14 : LOCKED. 2018-02-21 04:53:34,766 WARN [HBase-Metrics2-1] impl.MetricsConfig(125): Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties 2018-02-21 04:53:36,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:36,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:36,299 INFO [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188818299 2018-02-21 04:53:38,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:38,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE] 2018-02-21 04:53:38,310 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:38,310 INFO [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188820310 2018-02-21 04:53:38,311 WARN [Time-limited test] procedure2.ProcedureTestingUtility(146): Set Kill before store update to: false 2018-02-21 04:53:38,342 DEBUG [PEWorker-1] locking.LockProcedure(242): UNLOCKED pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE 2018-02-21 04:53:38,366 INFO [PEWorker-1] procedure2.ProcedureExecutor(1247): Finished pid=14, state=SUCCESS; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE in 6.0920sec 2018-02-21 04:53:38,628 INFO [Time-limited test] hbase.ResourceChecker(172): after: master.locking.TestLockProcedure#testTimeout Thread=209 (was 218), OpenFileDescriptor=851 (was 853), MaxFileDescriptor=60000 (was 60000), SystemLoadAverage=5045 (was 5075), ProcessCount=314 (was 312) - ProcessCount LEAK? -, AvailableMemoryMB=5071 (was 5373)
The problem is that, sleep 2 * HEARTBEAT_TIMEOUT is not safe. It is possible that after the first TIMEOUT the procedure has not been timeout yet and then the check for unlocked and the actual timeout processing will arrive at almost the same time. You can see the above log, we are going to mark the procedure as timeout at 04:53:38,299 and the heartbeat for checking unlock arrives at 04:53:38,310.
I think increase the wait time can solve the problem.
And the failure info
https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/2430/testReport/junit/org.apache.hadoop.hbase.master.locking/TestLockProcedure/testTimeout/