Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20978

[amv2] Worker terminating UNNATURALLY during MoveRegionProcedure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.1
    • 3.0.0-alpha-1, 2.2.0, 2.1.1, 2.0.2
    • amv2
    • None
    • Reviewed

    Description

      Testing tip of branch-2.0, ran into this:

      2018-07-29 01:45:33,002 INFO  [master/ve0524:16000] master.HMaster: Master has completed initialization 13.854sec                                                                                                           2018-07-29 01:45:33,003 INFO  [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820, state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure hri=533fb79ba23b27e9e0715b51daeb30c1, source=ve0538.halxg.cloudera.com,16020,1532847421672, destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on 533fb79ba23b27e9e0715b51daeb30c1                                                                                                  2018-07-29 01:45:33,003 WARN  [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating UNNATURALLY null
      java.lang.IllegalArgumentException: pid=1820, state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure hri=533fb79ba23b27e9e0715b51daeb30c1, source=ve0538.halxg.cloudera.com,16020,1532847421672, destination=ve0540.halxg.cloudera.com,16020,1532853151031
        at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)                                                                                                                   at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)                                                                                                                       at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
      

      It then shows as the below in the UI:

      
      Id	Parent	State	Owner	Type	Start Time	Last Update	Errors	Parameters
      1820		WAITING	stack	MoveRegionProcedure hri=533fb79ba23b27e9e0715b51daeb30c1, source=ve0538.halxg.cloudera.com,16020,1532847421672, destination=ve0540.halxg.cloudera.com,16020,1532853151031	Sun Jul 29 01:33:37 PDT 2018	Sun Jul 29 01:33:38 PDT 2018		[ { state => [ '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==', offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { hostName => 've0538.halxg.cloudera.com', port => '16020', startCode => '1532847421672' }, destinationServer => { hostName => 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' } } ]
      

      This is what we'd just read from hbase:meta:

      2018-07-29 01:45:32,802 INFO  [master/ve0524:16000] assignment.RegionStateStore: Load hbase:meta entry region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED, lastHost=ve0538.halxg.cloudera.com,16020,1532847421672, regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672, openSeqNum=1544600
      

      Before this, we'd just logged this:

      2018-07-29 01:33:39,786 INFO [PEWorker-14] assignment.RegionStateStore: pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED

      Going back in history, we do the above each time the Master gets restarted so the region is offlined and never brought back online.

      It is failing here:

        private void execProcedure(final RootProcedureState procStack,
            final Procedure<TEnvironment> procedure) {
          Preconditions.checkArgument(procedure.getState() == ProcedureState.RUNNABLE,
              procedure.toString());
      

      Its the parent move region that is trying to run and failing. It is not RUNNABLE? Because the subprocedure was 'done' but not fully?

      Attachments

        1. HBASE-20978.branch-2.0.001.patch
          1 kB
          Allan Yang
        2. HBASE-20978.branch-2.0.002.patch
          10 kB
          Michael Stack

        Issue Links

          Activity

            People

              allan163 Allan Yang
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: