[HIVE-19205] Hive streaming ingest improvements (v2) - ASF JIRA

XML

Word

Printable

JSON

This is umbrella jira to track hive streaming ingest improvements. At a high level following are the improvements

Support for dynamic partitioning
API changes (simple streaming connection builder)
Hide the transaction batches from clients (client can tune the transaction batch but doesn't have to know about the transaction batch size)
Support auto rollover to next transaction batch (clients don't have to worry about closing a transaction batch and opening a new one)
Record writers will all be strict meaning the schema of the record has to match table schema. This is to avoid the multiple serialization/deserialization for re-ordering columns if there is schema mismatch
Automatic distribution for non-bucketed tables so that compactor can have more parallelism
Create delta files with all ORC overhead disabled (no index, no compression, no dictionary). Compactor will recreate the orc files with index, compression and dictionary encoding.
Automatic memory management via auto-flushing (will yield smaller stripes for delta files but is more scalable and clients don't have to worry about distributing the data across writers)
Support for more writers (Avro specifically. ORC passthrough format?)
Support to accept input stream instead of record byte[]
Removing HCatalog dependency (old streaming API will be in the hcatalog package for backward compatibility, new streaming API will be in its own hive module)

1.	Support avro record writer for streaming ingest	Patch Available	Alan Gates
2.	Automatic distribution for non-bucketed tables during streaming ingest	Open	Prasanth Jayachandran
3.	Create standalone jar for hive streaming module	Patch Available	Prasanth Jayachandran