Uploaded image for project: 'DistributedLog'
  1. DistributedLog
  2. DL-145

Fix the flaky testServiceTimeout

    XMLWordPrintableJSON

Details

    • Test
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.4.0
    • None
    • distributedlog-service
    • None

    Description

      The TestDistributedLogService#testServiceTimeout case is not stable, e.g. https://builds.apache.org/job/distributedlog-precommit-pullrequest/22/com.twitter$distributedlog-service/testReport/com.twitter.distributedlog.service/TestDistributedLogService/testServiceTimeout/

      It could be reproduced on my box occasionally, and the failures were stable if i tuned the ServiceTimeoutMs from 200 to 150, and always passed if tuned to a larger value, e.g. 1000(btw, my disk is SSD type)

      After digging into it, shows it related with starting a new log segment corner case.
      For a good case, once service time out occurs, steam status : ERROR -> CLOSING -> CLOSED, calling Abortables.asyncAbort will trigger the cached logsegment be aborted, then writeOp will be injected an exception, e.g. write cancel exception.
      For a bad case, since no log records be written before, so there'll be an async start new log segment, once the timeout occurs, the segment starting still not be done, so no cache, then asyncAbort has no change to abort that segment.

      I think change the test timeout value to a larger one should be fine for this special test corner case.

      will attach a minor patch later. Any suggestions are welcome.

      Attachments

        Issue Links

          Activity

            People

              xieliang007 Liang Xie
              xieliang007 Liang Xie
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: