Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-4689

Deadlock in Kylin job execution

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Critical
    • Resolution: Unresolved
    • v3.0.0, v3.1.0, v3.0.1, v3.0.2
    • None
    • Job Engine
    • None

    Description

      Reproduction steps

      • Install Kylin 3.1.0
      • Deploy a streaming cube
      • Enable the cube having historical data present in the Kafka topic
      • Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly segments from Kafka when the cubes were enabled

      Expected result

      • Kylin is starting to process stream segments with stream jobs, eventually processing the older segments and catching up with the stream

      Actual result

      • A short time time after the stream jobs have started (37 successful stream jobs), all jobs are completely stuck without any progress. Some in running state, some in pending state.
      • The following logs are continuously written:
      2020-08-06 06:16:22 INFO  [Scheduler 116797841 Job 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is true,will try after one minute
      2020-08-06 06:16:33 WARN  [FetcherRunner 787667774-43] FetcherRunner:56 - There are too many jobs running, Job Fetch will wait until next schedule time
      
      • Zookeeper indicates the following locks are in place:
      ls /kylin/kylin_metadata/cube_job_ephemeral_lock
      [cube_cm, cube_vm, cube_jm]
      ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm 
      []
      ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
      []
      ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
      []
      ls /kylin/kylin_metadata/cube_job_lock
      [cube_cm, cube_vm, cube_jm]
      ls /kylin/kylin_metadata/cube_job_lock/cube_cm
      [f888380e-9ff4-98f5-2df4-1ae71e045f93]
      ls /kylin/kylin_metadata/cube_job_lock/cube_vm
      [fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
      ls /kylin/kylin_metadata/cube_job_lock/cube_jm
      [d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
      
      • The job IDs for the running jobs:
        • 169f75fa-a02f-221b-fc48-037bc7a842d0
        • 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
        • 00924699-8b51-8091-6e71-34ccfeba3a98
        • 4620192a-71e1-16dd-3b05-44d7f9144ad4
        • 416355c2-a3d7-57eb-55c6-c042aa256510
        • 12750aea-3b96-c817-64e8-bf893d8c120f
        • 42819dde-5857-fd6b-b075-439952f47140
        • 00128937-bd4a-d6c1-7a4e-744dee946f67
        • 46a0233f-217e-9155-725b-c815ad77ba2c
        • 062150ba-bacd-6644-4801-3a51b260d1c5

      As you can see, the 10 jobs that are actually running do not possess the locks thus cannot actually do anything (these all were stuck at step Build Dimension Dictionaries For Steaming Job). On the other hand, the 3 jobs possessing the locks cannot resume running because there are already 10 jobs in running state, thus cannot proceed and release the locks. This is a deadlock and the cluster is completely stuck.

      We have been observing this behavior in 3.0.0 (where rolling back https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now in 3.1.0 as well. It has been originally reported in the comments of https://issues.apache.org/jira/browse/KYLIN-4348 but I'm not sure that it's related to that bug/epic.

      Attachments

        Issue Links

          Activity

            People

              xxyu Xiaoxiang Yu
              arkigabor Gabor Arki
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: