[IMPALA-6188] test_top_n_reclaim is flaky - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.11.0
Fix Version/s: Impala 2.11.0
Component/s: Infrastructure
Labels:
- flaky

Epic Color:
ghx-label-3

Description

jbapple reported a test failure here: https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/607/

03:16:32  TestTopNReclaimQuery.test_top_n_reclaim[exec_option: {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: text/none] 
03:16:32 [gw10] linux2 -- Python 2.7.12 /home/ubuntu/Impala/bin/../infra/python/env/bin/python
03:16:32 query_test/test_queries.py:246: in test_top_n_reclaim
03:16:32     result = self.execute_query(self.QUERY, exec_options)
03:16:32 common/impala_test_suite.py:512: in wrapper
03:16:32     return function(*args, **kwargs)
03:16:32 common/impala_test_suite.py:537: in execute_query
03:16:32     return self.__execute_query(self.client, query, query_options)
03:16:32 common/impala_test_suite.py:604: in __execute_query
03:16:32     return impalad_client.execute(query, user=user)
03:16:32 common/impala_connection.py:160: in execute
03:16:32     return self.__beeswax_client.execute(sql_stmt, user=user)
03:16:32 beeswax/impala_beeswax.py:173: in execute
03:16:32     handle = self.__execute_query(query_string.strip(), user=user)
03:16:32 beeswax/impala_beeswax.py:341: in __execute_query
03:16:32     self.wait_for_completion(handle)
03:16:32 beeswax/impala_beeswax.py:361: in wait_for_completion
03:16:32     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
03:16:32 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
03:16:32 E    Query aborted:Memory limit exceeded
03:16:32 ---------------------------- Captured stderr setup -----------------------------
03:16:32 -- connecting to: localhost:21000
03:16:32 ----------------------------- Captured stderr call -----------------------------
03:16:32 SET batch_size=0;
03:16:32 SET num_nodes=0;
03:16:32 SET disable_codegen_rows_threshold=0;
03:16:32 SET disable_codegen=False;
03:16:32 SET abort_on_error=1;
03:16:32 SET mem_limit=50m;
03:16:32 SET exec_single_node_rows_threshold=0;
03:16:32 -- executing against localhost:21000
03:16:32 select * from tpch.lineitem order by l_orderkey desc limit 10;;

I was able to reproduce something similar locally by running in a loop:

E   ImpalaBeeswaxException: ImpalaBeeswaxException:
E    Query aborted:Memory limit exceeded: Failed to allocate memory in TopNNode::ReclaimTuplePool.
E   SORT_NODE (id=1) could not allocate 190.00 B without exceeding limit.
E   Error occurred on backend tarmstrong-box:22000 by fragment c641289b5b0652a4:9ce6652000000001
E   Memory left in process limit: 8.15 GB
E   Memory left in query limit: -2.95 MB
E   Query(c641289b5b0652a4:9ce6652000000000): memory limit exceeded. Limit=50.00 MB Reservation=0 ReservationLimit=0 OtherMemory=52.95 MB Total=52.95 MB Peak=52.95 MB
E     Fragment c641289b5b0652a4:9ce6652000000000: Reservation=0 OtherMemory=8.30 KB Total=8.30 KB Peak=232.50 KB
E       EXCHANGE_NODE (id=2): Total=0 Peak=0
E       DataStreamRecvr: Total=0 Peak=0
E       PLAN_ROOT_SINK: Total=0 Peak=0
E       CodeGen: Total=305.00 B Peak=224.50 KB
E     Fragment c641289b5b0652a4:9ce6652000000001: Reservation=0 OtherMemory=52.94 MB Total=52.94 MB Peak=52.94 MB
E       SORT_NODE (id=1): Total=702.00 KB Peak=706.00 KB
E       HDFS_SCAN_NODE (id=0): Total=52.23 MB Peak=52.23 MB
E       DataStreamSender (dst_id=2): Total=688.00 B Peak=688.00 B
E       CodeGen: Total=23.94 KB Peak=1.64 MB
E   
E   Memory limit exceeded: Failed to allocate memory in TopNNode::ReclaimTuplePool.
E   SORT_NODE (id=1) could not allocate 190.00 B without exceeding limit.
E   Error occurred on backend tarmstrong-box:22000 by fragment c641289b5b0652a4:9ce6652000000001
E   Memory left in process limit: 8.15 GB
E   Memory left in query limit: -2.95 MB
E   Query(c641289b5b0652a4:9ce6652000000000): memory limit exceeded. Limit=50.00 MB Reservation=0 ReservationLimit=0 OtherMemory=52.95 MB Total=52.95 MB Peak=52.95 MB
E     Fragment c641289b5b0652a4:9ce6652000000000: Reservation=0 OtherMemory=8.30 KB Total=8.30 KB Peak=232.50 KB
E       EXCHANGE_NODE (id=2): Total=0 Peak=0
E       DataStreamRecvr: Total=0 Peak=0
E       PLAN_ROOT_SINK: Total=0 Peak=0
E       CodeGen: Total=305.00 B Peak=224.50 KB
E     Fragment c641289b5b0652a4:9ce6652000000001: Reservation=0 OtherMemory=52.94 MB Total=52.94 MB Peak=52.94 MB
E       SORT_NODE (id=1): Total=702.00 KB Peak=706.00 KB
E       HDFS_SCAN_NODE (id=0): Total=52.23 MB Peak=52.23 MB
E       DataStreamSender (dst_id=2): Total=688.00 B Peak=688.00 B
E       CodeGen: Total=23.94 KB Peak=1.64 MB

My current hypothesis is that it's spinning up an extra scanner thread in the failure case.

test_top_n_reclaim is flaky

Details

Description

Attachments

Activity

People

Dates