Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.8.0, 0.9.0, 0.8.1, 0.10.0
-
None
-
Master Node 1 + Worker Node 8 (OpenJDK7 + Hadoop 2.4.0 in Ubuntu LTS 14.04)
Description
When executing a bunch of (about >100) queries sequentially on my own JDBC batch program, Tajo frequently hangs its query allocation with notification of 'No available worker resource', after successful completion of the recent query. For instance,
2014-10-28 15:57:50,481 INFO org.apache.tajo.master.rm.TajoWorkerResourceManager: Release Resource: 0.5,512 ... 2014-10-28 15:57:50,489 INFO org.apache.tajo.master.rm.TajoWorkerResourceManager: Release Resource: 0.5,512 2014-10-28 15:57:50,511 INFO org.apache.tajo.catalog.CatalogServer: relation "lw_nps_2013.table201_c23_relation_hpin" is added to the catalog (192.168.0.70:26005) 2014-10-28 15:57:50,512 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: allocateWorkerResources:q_1414475997452_0013, requiredMemory:512~512, requiredContainers:32, requiredDiskSlots:0.5~0.5, queryMasterRequest=false, liveWorkers=8 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.querymaster.QueryInProgress: Received QueryMaster heartbeat:q_1414475997452_0013,state=QUERY_SUCCEEDED,progress=1.0, queryMaster=2.linewalks.local 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.querymaster.QueryInProgress: ========================================================= 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.querymaster.QueryInProgress: Stop query:q_1414475997452_0013 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.rm.TajoWorkerResourceManager: Release Resource: 0.0,512 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.rm.TajoWorkerResourceManager: Released QueryMaster (q_1414475997452_0013) resource. 2014-10-28 15:57:50,512 INFO org.apache.tajo.master.querymaster.QueryInProgress: q_1414475997452_0013 QueryMaster stopped 2014-10-28 15:57:50,513 WARN org.apache.tajo.master.TajoAsyncDispatcher: Interrupted Exception while stopping 2014-10-28 15:57:50,513 INFO org.apache.tajo.master.querymaster.QueryJobManager: Stop QueryInProgress:q_1414475997452_0013 2014-10-28 15:57:50,513 INFO org.apache.tajo.master.TajoAsyncDispatcher: AsyncDispatcher stopped:QueryInProgress:q_1414475997452_0013 2014-10-28 15:57:50,829 INFO org.apache.tajo.master.GlobalEngine: Query: create table table201_c23_relation_valid_hpin as ... 2014-10-28 15:57:50,886 INFO org.apache.tajo.master.GlobalEngine: Non Optimized Query: ... 2014-10-28 15:57:50,888 INFO org.apache.tajo.master.GlobalEngine: ============================================= 2014-10-28 15:57:50,888 INFO org.apache.tajo.master.GlobalEngine: Query is forwarded to :0 2014-10-28 15:57:50,888 INFO org.apache.tajo.master.TajoAsyncDispatcher: AsyncDispatcher started:QueryInProgress:q_1414475997452_0014 2014-10-28 15:57:50,889 INFO org.apache.tajo.master.querymaster.QueryInProgress: Initializing QueryInProgress for QueryID=q_1414475997452_0014 2014-10-28 15:57:50,889 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: allocateWorkerResources:q_1414475997452_0014, requiredMemory:512~512, requiredContainers:1, requiredDiskSlots:0.0~0.0, queryMasterRequest=false, liveWorkers=8 2014-10-28 15:57:50,889 INFO org.apache.tajo.master.querymaster.QueryInProgress: Connect to QueryMaster:5.linewalks.local/192.168.0.75:28093 2014-10-28 15:57:50,890 INFO org.apache.tajo.master.querymaster.QueryInProgress: Call executeQuery to :5.linewalks.local:28093,q_1414475997452_0014 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: allocateWorkerResources:q_1414475997452_0014, requiredMemory:512~512, requiredContainers:10, requiredDiskSlots:0.5~0.5, queryMasterRequest=false, liveWorkers=8 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: ========================================= 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: Available Workers 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@2b45dc1a 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@77754527 2014-10-28 15:57:51,589 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@340f40b6 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@5561876d 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@760230e5 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@9d846ea 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@5617b1b2 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@2d934946 2014-10-28 15:57:51,590 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: ========================================= 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: allocateWorkerResources:q_1414475997452_0014, requiredMemory:512~512, requiredContainers:10, requiredDiskSlots:0.5~0.5, queryMasterRequest=false, liveWorkers=8 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: ========================================= 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: Available Workers 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@2b45dc1a 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@77754527 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@340f40b6 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@5561876d 2014-10-28 15:57:51,690 DEBUG org.apache.tajo.master.rm.TajoWorkerResourceManager: org.apache.tajo.master.rm.Worker@760230e5 ... (continuously same log repetition)
014-10-28 15:57:22,048 INFO org.apache.tajo.master.querymaster.QueryMasterManagerService: Receive executeQuery request:q_1414475997452_0014 2014-10-28 15:57:22,050 INFO org.apache.tajo.master.querymaster.QueryMaster: Start QueryStartEventHandler:q_1414475997452_0014 2014-10-28 15:57:22,091 INFO org.apache.tajo.master.querymaster.QueryMasterTask: The staging dir 'hdfs://0.linewalks.local:9000/tmp/tajo-hadoop/staging/q_1414475997452_0014' is created. 2014-10-28 15:57:22,398 INFO org.apache.tajo.engine.planner.global.GlobalPlanner: ... 2014-10-28 15:57:22,412 INFO org.apache.tajo.master.TajoAsyncDispatcher: AsyncDispatcher started:q_1414475997452_0014 2014-10-28 15:57:22,412 INFO org.apache.tajo.master.querymaster.Query: Processing q_1414475997452_0014 of type START 2014-10-28 15:57:22,417 INFO org.apache.tajo.master.querymaster.Query: q_1414475997452_0014 Query Transitioned from QUERY_NEW to QUERY_RUNNING 2014-10-28 15:57:22,421 INFO org.apache.tajo.master.querymaster.SubQuery: org.apache.tajo.master.DefaultTaskScheduler is chosen for the task scheduling for eb_1414475997452_0014_000003 2014-10-28 15:57:22,434 INFO org.apache.tajo.storage.StorageManager: Total input paths to process : 10 2014-10-28 15:57:22,498 INFO org.apache.tajo.storage.StorageManager: # of splits with volumeId 10 2014-10-28 15:57:22,498 INFO org.apache.tajo.storage.StorageManager: Total # of splits: 10 2014-10-28 15:57:22,504 INFO org.apache.tajo.storage.StorageManager: Total input paths to process : 48 2014-10-28 15:57:22,575 INFO org.apache.tajo.storage.StorageManager: # of splits with volumeId 48 2014-10-28 15:57:22,575 INFO org.apache.tajo.storage.StorageManager: Total # of splits: 48 2014-10-28 15:57:22,576 INFO org.apache.tajo.master.querymaster.Repartitioner: [Distributed Join Strategy] : Broadcast Join, base_table=lw_nps_2013.table201_individual_stats, base_volume=976581946 2014-10-28 15:57:22,580 INFO org.apache.tajo.storage.StorageManager: Total input paths to process : 10 2014-10-28 15:57:22,600 INFO org.apache.tajo.storage.StorageManager: # of splits with volumeId 10 2014-10-28 15:57:22,600 INFO org.apache.tajo.storage.StorageManager: Total # of splits: 10 2014-10-28 15:57:22,608 INFO org.apache.tajo.storage.StorageManager: Total input paths to process : 48 2014-10-28 15:57:22,689 INFO org.apache.tajo.storage.StorageManager: # of splits with volumeId 48 2014-10-28 15:57:22,689 INFO org.apache.tajo.storage.StorageManager: Total # of splits: 48 2014-10-28 15:57:22,707 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 2.linewalks.local to /linewalks/rack1 2014-10-28 15:57:22,712 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 3.linewalks.local to /linewalks/rack2 2014-10-28 15:57:22,715 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 0.linewalks.local to /linewalks/rack1 2014-10-28 15:57:22,718 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 1.linewalks.local to /linewalks/rack1 2014-10-28 15:57:22,718 INFO org.apache.tajo.master.querymaster.SubQuery: 10 objects are scheduled 2014-10-28 15:57:22,718 INFO org.apache.tajo.master.DefaultTaskScheduler: Start TaskScheduler 2014-10-28 15:57:22,718 INFO org.apache.tajo.worker.TajoResourceAllocator: CalculateNumberRequestContainer - Number of Tasks=10, Number of Cluster Slots=127 2014-10-28 15:57:22,723 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 4.linewalks.local to /linewalks/rack2 2014-10-28 15:57:22,723 INFO org.apache.tajo.master.querymaster.SubQuery: Request Container for eb_1414475997452_0014_000003 containers=10 2014-10-28 15:57:22,727 INFO org.apache.tajo.worker.TajoResourceAllocator: Start TajoWorkerAllocationThread 2014-10-28 15:57:25,734 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:28,735 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:31,735 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:34,736 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:37,736 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:40,736 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:43,737 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 2014-10-28 15:57:46,737 INFO org.apache.tajo.worker.TajoResourceAllocator: No available worker resource for eb_1414475997452_0014_000003 ....
After investigating the symptom in DEBUG mode, I concluded that TajoWorkerResourceManager could sometimes not release semaphore for resourceRequest callback due to if-else condition, especially here:
// TajoWorkerResourceManager can't return allocated disk slots occasionally. // Because the rest resource request can remains after QueryMaster stops. // Thus we need to find whether QueryId stopped or not. if (!rmContext.getStoppedQueryIds().contains(resourceRequest.queryId)) { List<AllocatedWorkerResource> allocatedWorkerResources = chooseWorkers(resourceRequest); if(allocatedWorkerResources.size() > 0) { List<WorkerAllocatedResource> allocatedResources = new ArrayList<WorkerAllocatedResource>(); .... //// Semaphore is currently released here! resourceRequest.callBack.run(WorkerResourceAllocationResponse.newBuilder() .setQueryId(resourceRequest.request.getQueryId()) .addAllWorkerAllocatedResource(allocatedResources) .build() ); } else { if(LOG.isDebugEnabled()) { LOG.debug("========================================="); LOG.debug("Available Workers"); ...
Is there anybody knowing solution to avoid the race condition? This problem have existed since 0.8.x, where I started to use Tajo.
Thanks a lot for your support in advance!
Attachments
Issue Links
- Is contained by
-
TAJO-1397 Resource allocation should be fine grained.
- Resolved