Uploaded image for project: 'Apache Trafodion (Retired)'
  1. Apache Trafodion (Retired)
  2. TRAFODION-2070

Trafodion cannot adjust your working status in time when network broken.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 2.0-incubating, 2.1-incubating
    • 2.3
    • dtm
    • None

    Description

      Issue Title: Trafodion cannot adjust your working status in time when network broken.

      Test Steps (including part 1 and part 2):
      Preconditoin: the testing environment is good, including HDFS, HBase and EsgnDB.
      Part 1: Network broken occurred for a long time, here limit it as 15 minutes.
      Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like insert/delete/update/select, at the same time, the sql statement has running for several minutes.
      Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104 node’s network unreachable for 15 minutes.
      Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running status.
      Here, check Step 0 and major SQL command should be on nap101, nap102 and nap103 nodes.
      T-1 :
      Step 0 Comments
      Expect 1. When TRAFCI is connected to nap104 node, the SQL statement ‘STMT_A’ run failed and exit TRAFCI normally.
      2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL statement ‘STMT_A’ run success and exit TRAFCI normally.
      Actual For expect 1:
      ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it normally displays “SQL>get statistics for qid MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7;

          • ERROR[2024] Server Process $ZSM003 is not running or could not be created. Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back to TRAFCI interface, the SQL statement ‘STMT_A’ cannot be return in time but hang for a long time.

      ISSUE 2: At the same time, open a new TRAFCI that is connected to other node for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, but print the following error and say 0 row(s) selected
      “*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.

      — 0 row(s) selected.”
      ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see these TRAFCI sessions are interrupted because of the below error.
      “SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;

          • ERROR[29157] There was a problem reading from the server
          • ERROR[29160] The message header was not long enough

      SQL>insert into josh_test_after values (1);

          • ERROR[29443] Database connection does not exist. Please connect to the database by using the connect or the reconnect command.”.

      For expect 2:
      ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it normally displays
      “SQL>get statistics for qid
      MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+>

      Qid MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5
      Compile Start Time 2016/05/31 10:02:56.968052
      Compile End Time 2016/05/31 10:02:59.207231
      Compile Elapsed Time 0:00:02.239179
      Execute Start Time 2016/05/31 10:02:59.207468
      Execute End Time -1
      Execute Elapsed Time 0:05:46.731948
      State CLOSE”, back to TRAFCI interface, the SQL statement ‘STMT_A’ cannot be return in time but hang for a long time.
      ISSUE 2: at the same time, open a new TRAFCI that is connected to other node for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, but print the following error and say 0 row(s) selected
      “*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.

      — 0 row(s) selected.
      >>”
      BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally.

      T-2
      Command: sqcheck DTM Down RMS Down DCS Master Down DCS Server Down MxoSrvr Down Comments
      Expect 1 2 0 1 4 Return in 1 minute
      Actual 1 2 0 1 4 Return in 1 minute

      T-3
      Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Comments
      Expect 0 1 4 Return in 1 minute
      Actual 0 1 4 Return about 5 minutes

      T-4
      Command: shell -c node info Comments
      Expect Only nap104 node down, return in 30 seconds
      Actual Only nap104 node down, return in 10 seconds

      T-5
      Command: cstat Comments
      Expect Return in 30 seconds.
      Actual Return about 3 minutes.

      T-6
      Commands: trafci Comments
      Expect 1. Login success, Return in 1 minute whatever login success or failed.
      2. Run new SQL statement ‘STMT_B’ success in TRAFCI and normally exit TRAFCI.
      Actual Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI session to execute ‘STMT_B’,
      1. Login success ,Return in 1 minute whatever login success or failed.
      2. Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI.
      Case 2: When ‘STMT_A’ is connected to other node for example nap102, open a new TRAFCI session to execute ‘STMT_B’,
      1. Sometime login success, returned time is not fixed, sometime login failed because of hang for a long time, no message printed for example timeout tips.
      2. Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI.

      T-7
      HDFS Comments
      Expect Only nap104 data node down, other 3 data nodes up, 1 name node up, Data Node Health Summary process reports a minor alert.
      Actual Only nap104 data node down, other 3 data nodes up, 1 name node up, Data Node Health Summary process reports a minor alert.

      T-8
      HBase Comments
      Expect 1 region server down, other 3 region servers up, 1 HBASE master up, RegionServer Health Summary process reports a minor alert.
      Actual 1 region server down, other 3 region servers up, 1 HBASE master up, RegionServer Health Summary process reports a minor alert.

      Step 3. After 15 minutes, nap104 node’s network is reachable using command ‘iptables -D INPUT -s $NODE_HOST -j DROP’
      Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running status.

      T-11
      Command: sqcheck DTM Down RMS Down DCS Master Down DCS Server Down MxoSrvr Down Comments
      Expect on any node 0 0 0 0 0 Return in 1 minute
      Actual on nap101 node 1 2 0 1 4 Return in 1 minute
      Actual on nap104 node 3 6 0 1 16 Return in 1 minute

      T-12
      Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Cmments
      Expect on any node 0 0 0 Return in 1 minute
      Actual on nap101 node 0 1 4 Return in 1 minute
      Actual on nap104 node 0 1 4 Return in 1 minute

      T-13
      Command: shell -c node info Comments
      Expect on any node 4 nodes up, return in 30 seconds
      Actual on nap101 node Only nap104 down, other nap101, nap102 and nap103 nodes up, return in 30 seconds.
      Actual on nap104 node Only nap101, nap102 and nap103 nodes down, nap104 up, return in 30 seconds.

      T-14
      Command: cstat Comments
      Expect on any node return in 30 seconds
      Actual on any node return in 30 seconds

      T-15
      Command: trafci Comments
      Expect on any node 1. Login success, Return in 1 minute whatever login success or failed.
      2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.
      Actual on any node 1. Login success, Return in 1 minute whatever login success or failed.
      2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.

      T-16
      HDFS Comments
      Expect 4 data nodes up, 1 name node up, no alerts.
      Actual 4 data nodes up, 1 name node up, no alerts.

      T-17
      HBase Comments
      Expect 4 region servers up, 1 HBASE master up, no alerts.
      Actual 1 region server process in nap104 node down (CRITICAL MESSAGE: Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master node) reports “Dead RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary process. 1 HBASE master up.

      Part 2: Network unstable. ok for 1 minute and down for another minute, again and again
      Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like insert/delete/update/select, at the same time, the sql statement has running for several minutes.
      Step 1. Make nap104 node’s network unstable, check Step 0, major SQL commands, HDFS and HBase running status.
      Here, check Step 0 and major SQL command should be on nap101, nap102 and nap103 nodes.
      T-21
      Step 0 Comments
      Expect 1. When TRAFCI is connected to nap104 node, the SQL statement ‘STMT_A’ run failed and exit TRAFCI normally.
      2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL statement ‘STMT_A’ run success and exit TRAFCI normally.
      Actual For expect 1:
      ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
      SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;

          • ERROR[29157] There was a problem reading from the server
          • ERROR[29160] The message header was not long enough

      SQL>insert into josh_test values (1);

          • ERROR[29443] Database connection does not exist. Please connect to the database by using the connect or the reconnect command.
            ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement ‘STMT_B’ and get the following errors.
          • ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
            java.util.concurrent.ExecutionException: java.io.IOException: performScan encountered Exception txID: 8591654594 Exception: org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner - scanner id 0, already closed?
            java.util.concurrent.FutureTask.report(FutureTask.java:122)
            java.util.concurrent.FutureTask.get(FutureTask.java:188)
            org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
            . [2016-05-31 12:13:05]
            For expect 2:
            ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
            SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
          • ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
            java.util.concurrent.ExecutionException: java.io.IOException: performScan encountered Exception txID: 4296737298 Exception: org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected nextCallSeq: 50, But the nextCallSeq received from client: 49
            java.util.concurrent.FutureTask.report(FutureTask.java:122)
            java.util.concurrent.FutureTask.get(FutureTask.java:188)
            org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
            . [2016-05-31 11:30:36]

      T-22
      Command: sqcheck DTM Down RMS Down DCS Master Down DCS Server Down MxoSrvr Down Comments
      Network Broken in a minute Expect 1 0 0 1 4 Return in 1 minute
      Actual 0 0 0 1 4 Return in 1 minute
      Network Recover in 1 minute Expect 0 0 0 0 0 Return in 2 minute
      Actual 0 0 0 0 0 Return in 2 minute

      T-23
      Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Comments
      Network Broken in 1 minute Expect 0 1 4 Return in 2 minute
      Actual 0 1 4 Return in 2 minutes
      Network Recover in 1 minute Expect 0 1 4 Return in 2 minute
      Actual 0 1 4 Return in 2 minutes.

      T-24
      Command: shell -c node info Comments
      Network Broken in 1 minute Expect Only nap104 node down, other nodes up, return in 30 seconds
      Actual 4 nodes up, return in 10 seconds
      Network Recover in 1 minute Expect 4 nodes up, return in 10 seconds
      Actual 4 nodes up, return in 10 seconds

      T-25
      Command: cstat Comments
      Network Broken in 1 minute Expect Return in 30 seconds.
      Actual Return in 30 seconds.
      Network Recover in 1 minute Expect Return in 30 seconds.
      Actual Return in 30 seconds.

      T-26
      Commands: trafci Comments
      Network Broken in 1 minute Expect Login success, Return in 1 minute whatever login success or failed.
      Actual Login success, Return in 1 minute whatever login success or failed
      Network Recover in 1 minute Expect Login success, Return in 1 minute whatever login success or failed
      Actual Login success, Return in 1 minute whatever login success or failed

      T-27
      HDFS Comments
      Network Broken in 1 minute Expect Only nap104 data node down, other data nodes up, 1 name node up.
      Actual Only nap104 data node down, other data nodes up, 1 name node up.
      Network Recover in 1 minute Expect 4 data nodes up, 1 name node up.
      Actual 4 data nodes up, 1 name node up.

      T-28
      HBase Comments
      Network Broken in 1 minute Expect 1 region server down, other 3 region servers up, 1 HBASE master up, RegionServer Health Summary process reports a minor alert.
      Actual 1 region server down, other 3 region servers up, 1 HBASE master up, RegionServer Health Summary process reports a minor alert.
      Network Recover in 1 minute Expect 4 region servers up, 1 HBASE master up, no alerts.
      Actual 1 region server process in nap104 node down (CRITICAL MESSAGE: Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master node) reports “Dead RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary process. 1 HBASE master up.

      Attachments

        Activity

          People

            ovis_poly liu ming
            bo.yu@esgyn.cn Jarek
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: