[KYLIN-2501] Stream Aggregate GTRecords at Query Server - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: v1.6.0
Fix Version/s: v2.0.0
Component/s: Query Engine, Storage - HBase
Labels:
None

Description

Problem
When query server needs to handle millions of records, CubeTupleConverter could become performance bottleneck.
An experiment shows that converting 5 millions records takes ~11s, which accounts for 50% of the total query time.

Motivation
Records returned from each storage partition is guaranteed to be ordered. Therefore we could reduce the number of records passed to CubeTupleConverter by

merge sorted records from all partitions, similar to what we have done in ~~KYLIN-1787~~
use a stream aggregate algorithm on merged stream to aggregate those records with the same key

Proposal

Add a new physical operator GTStreamAggregateScanner which implements the stream aggregate algorithm
Refine SortedIteratorMergerWithLimit that was used to merge sort records from different partitions. The previous implementation has performance issues (~~KYLIN-2483~~) due to expensive record clone
Leverage GTStreamAggregateScanner to aggregate records on merged stream

Scope
Stream aggregate has some good properties such as low memory usage and streamable ordered outputs, making it better than hash/sort based alternatives when input is already sorted. So I bet the new GTStreamAggregateScanner operator can also be used to accelerate cubing and coprocessor aggregation in certain cases. I'll focus on query server in this jira and leave those optimizations as future works.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KYLIN-2501.patch
23/Mar/17 02:38
68 kB
Dayue Gao

Activity

People

Assignee:: Dayue Gao

Reporter:: Dayue Gao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Mar/17 06:06

Updated:: 30/Apr/17 12:21

Resolved:: 31/Mar/17 08:40