Goal

Read streaming data form Kafka queue as an external table.
Allow streaming navigation by pushing down filters on Kafka record partition id, offset and timestamp.

Insert streaming data form Kafka to an actual Hive internal table, using CTAS statement.

Example

Create the external table

 
CREATE EXTERNAL TABLE kafka_table (`timestamp` timestamp, page string, `user` string, language string, added int, deleted int, flags string,comment string, namespace string)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES 
("kafka.topic" = "wikipedia", 
"kafka.bootstrap.servers"="brokeraddress:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe");

Kafka Metadata

In order to keep track of Kafka records the storage handler will add automatically the Kafka row metadata eg partition id, record offset and record timestamp.

DESCRIBE EXTENDED kafka_table

timestamp              	timestamp           	from deserializer   
page                	string              	from deserializer   
user                	string              	from deserializer   
language            	string              	from deserializer   
country             	string              	from deserializer   
continent           	string              	from deserializer   
namespace           	string              	from deserializer   
newpage             	boolean             	from deserializer   
unpatrolled         	boolean             	from deserializer   
anonymous           	boolean             	from deserializer   
robot               	boolean             	from deserializer   
added               	int                 	from deserializer   
deleted             	int                 	from deserializer   
delta               	bigint              	from deserializer   
__partition         	int                 	from deserializer   
__offset            	bigint              	from deserializer   
__timestamp         	bigint              	from deserializer

Filter push down.

Newer Kafka consumers 0.11.0 and higher allow seeking on the stream based on a given offset. The proposed storage handler will be able to leverage such API by pushing down filters over metadata columns, namely __partition (int), __offset(long) and __timestamp(long)
For instance Query like

 
select `__offset` from kafka_table where (`__offset` < 10 and `__offset`>3 and `__partition` = 0) or (`__partition` = 0 and `__offset` < 105 and `__offset` > 99) or (`__offset` = 109);

Will result on a scan of partition 0 only then read only records between offset 4 and 109.

With timestamp seeks

The seeking based on the internal timestamps allows the handler to run on recently arrived data, by doing

select count(*) from kafka_table where `__timestamp` >  1000 * to_unix_timestamp(CURRENT_TIMESTAMP - interval '20' hours) ;

This allows for implicit relationships between event timestamps and kafka timestamps to be expressed in queries (i.e event_timestamp is always < than kafka __timestamp and kafka __timestamp is never > 15 minutes from event etc).

More examples with Avro

CREATE EXTERNAL TABLE wiki_kafka_avro_table
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES
("kafka.topic" = "wiki_kafka_avro_table",
"kafka.bootstrap.servers"="localhost:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.avro.AvroSerDe",
'avro.schema.literal'='{
  "type" : "record",
  "name" : "Wikipedia",
  "namespace" : "org.apache.hive.kafka",
  "version": "1",
  "fields" : [ {
    "name" : "isrobot",
    "type" : "boolean"
  }, {
    "name" : "channel",
    "type" : "string"
  }, {
    "name" : "timestamp",
    "type" : "string"
  }, {
    "name" : "flags",
    "type" : "string"
  }, {
    "name" : "isunpatrolled",
    "type" : "boolean"
  }, {
    "name" : "page",
    "type" : "string"
  }, {
    "name" : "diffurl",
    "type" : "string"
  }, {
    "name" : "added",
    "type" : "long"
  }, {
    "name" : "comment",
    "type" : "string"
  }, {
    "name" : "commentlength",
    "type" : "long"
  }, {
    "name" : "isnew",
    "type" : "boolean"
  }, {
    "name" : "isminor",
    "type" : "boolean"
  }, {
    "name" : "delta",
    "type" : "long"
  }, {
    "name" : "isanonymous",
    "type" : "boolean"
  }, {
    "name" : "user",
    "type" : "string"
  }, {
    "name" : "deltabucket",
    "type" : "double"
  }, {
    "name" : "deleted",
    "type" : "long"
  }, {
    "name" : "namespace",
    "type" : "string"
  } ]
}'
);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-20377.patch
13/Aug/18 20:04
160 kB
Slim Bouguerra
HIVE-20377.4.patch
13/Aug/18 22:35
162 kB
Slim Bouguerra
HIVE-20377.5.patch
14/Aug/18 00:36
192 kB
Slim Bouguerra
HIVE-20377.6.patch
14/Aug/18 20:46
196 kB
Slim Bouguerra
HIVE-20377.8.patch
15/Aug/18 16:55
197 kB
Slim Bouguerra
HIVE-20377.8.patch
15/Aug/18 23:20
197 kB
Slim Bouguerra
HIVE-20377.10.patch
17/Aug/18 23:20
199 kB
Slim Bouguerra
HIVE-20377.11.patch
18/Aug/18 03:36
200 kB
Slim Bouguerra
HIVE-20377.12.patch
20/Aug/18 15:30
265 kB
Slim Bouguerra
HIVE-20377.15.patch
23/Aug/18 01:18
271 kB
Slim Bouguerra
HIVE-20377.18.patch
28/Aug/18 20:48
272 kB
Slim Bouguerra
HIVE-20377.18.patch
29/Aug/18 02:21
272 kB
Slim Bouguerra
HIVE-20377.19.patch
29/Aug/18 14:55
272 kB
Slim Bouguerra
HIVE-20377.19.patch
30/Aug/18 16:26
272 kB
Slim Bouguerra
HIVE-20377.19.patch
04/Sep/18 19:25
272 kB
Slim Bouguerra

Issue Links

is related to

HIVE-20561 Use the position of the Kafka Consumer to track progress instead of Consumer Records offsets

Closed

is required by

HIVE-20486 Kafka: Use Row SerDe + vectorization

Closed

links to

Design Doc

RB Link

Sub-Tasks

There are no Sub-Tasks for this issue.

Hive Kafka Storage Handler

Details

Description

Goal

Example

Create the external table

Kafka Metadata

Filter push down.

With timestamp seeks

More examples with Avro

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates