[IMPALA-6564] Queries randomly fail with "CANCELLED" due to a race with IssueInitialRanges() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Not A Bug
Affects Version/s: Impala 2.12.0
Fix Version/s: None
Component/s: Backend
Labels:
- flaky

Target Version:

Impala 3.0, Impala 2.12.0
Epic Color:
ghx-label-4

Description

I've been chasing a flaky test that I saw in test_basic_runtime_filters when running against https://gerrit.cloudera.org/#/c/8966/ (the scanner buffer pool changes).

I think it is a latent bug that has started reproducing more frequently. What I've found is:

Different queries fail with CANCELLED. I can repro it on my branch ~3/4 times by running: impala-py.test tests/query_test/test_runtime_filters.py -n8 --verbose --maxfail 1 -k basic . It happens with a variety of queries and file formats.
It seems to happen when all files are pruned out by runtime filters

Logging reveals IssueInitialRanges() fails with a CANCELLED status, which propagates up to the query status:

  if (!initial_ranges_issued_) {
    // We do this in GetNext() to maximise the amount of work we can do while waiting for
    // runtime filters to show up. The scanner threads have already started (in Open()),
    // so we need to tell them there is work to do.
    // TODO: This is probably not worth splitting the organisational cost of splitting
    // initialisation across two places. Move to before the scanner threads start.
    Status status = IssueInitialScanRanges(state);
    if (!status.ok()) LOG(INFO) << runtime_state_->fragment_instance_id() << " IssueInitialRanges() failed with status: " << status.GetDetail()  << " " << (void*) this;

It appears that the CANCELLED comes from DiskIoMgr::AddScanRanges().

That function returned cancelled because a scanner thread noticed that the scan was complete here and cancelled the RequestContext:

    // Done with range and it completed successfully
    if (progress_.done()) {
      // All ranges are finished.  Indicate we are done.
      LOG(INFO) << runtime_state_->fragment_instance_id() << " All ranges done " << (void*) this;
      SetDone();
      break;
    }

Attachments

Issue Links

blocks

IMPALA-4835 HDFS scans should operate with a constrained number of I/O buffers

Resolved

Activity

People

Assignee:: Tim Armstrong

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Feb/18 19:23

Updated:: 03/May/18 15:28

Resolved:: 22/Feb/18 21:22