[KUDU-2437] Split a tablet into primary key ranges by size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.8.0
Component/s: client, tablet
Labels:
None

Description

When reading data in a kudu table using spark, if there is a large amount of data in the tablet, reading the data takes a long time. The reason is that KuduRDD uses a tablet to generate the scanToken, so a spark task needs to process all the data in a tablet.

We think that TabletServer should provide an RPC interface, which can be split tablet into multiple primary key ranges by size. The kudu-client can choose whether to perform parallel scan according to the case.

RPC interface:

// A split key range request. Split tablet to key ranges, the request
// doesn't change layout of tablet.
message SplitKeyRangeRequestPB {
 required bytes tablet_id = 1;

 // Encoded primary key to begin scanning at (inclusive).
 optional bytes start_primary_key = 2 [(kudu.REDACT) = true];
 // Encoded primary key to stop scanning at (exclusive).
 optional bytes stop_primary_key = 3 [(kudu.REDACT) = true];

 // Number of bytes to try to return in each chunk. This is a hint.
 // The tablet server may return chunks larger or smaller than this value.
 optional uint64 target_chunk_size_bytes = 4;

 // The columns to consider when chunking.
 // If specified, then the size estimate used for 'target_chunk_size_bytes'
 // should only include these columns. This can be used if a query will
 // only scan a certain subset of the columns.
 repeated ColumnSchemaPB columns = 5;
}

// The primary key range of a Kudu tablet.
message KeyRangePB {
 // Encoded primary key to begin scanning at (inclusive).
 optional bytes start_primary_key = 1 [(kudu.REDACT) = true];
 // Encoded primary key to stop scanning at (exclusive).
 optional bytes stop_primary_key = 2 [(kudu.REDACT) = true];
 // Number of bytes in chunk.
 required uint64 size_bytes_estimates = 3;
}

message SplitKeyRangeResponsePB {
 // The error, if an error occurred with this request.
 optional TabletServerErrorPB error = 1;

 repeated KeyRangePB ranges = 2;
}

Attachments

Issue Links

is duplicated by

KUDU-1686 Add API to split a scan token into smaller scans

Resolved

is related to

KUDU-2917 Split a tablet into primary key ranges by number of row

Open

relates to

IMPALA-9792 Split Kudu scan ranges into smaller chunks for greater paralellelism

Open

Activity

People

Assignee:: Xu Yao

Reporter:: Xu Yao

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/May/18 03:57

Updated:: 03/Jun/20 15:37

Resolved:: 18/Sep/18 17:04