Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Implemented
-
None
-
None
-
None
Description
The Storm bolt get a random bucket and writes data to it. Hive has expectation that rows (tuples for storm) are distributed across buckets using Hive's hash distribution. Writing to a random bucket by Storm leads to Hive optimizations that rely on bucketing to return incorrect results.
The solution is for Storm Hive Bolt to use Hive bucket distribution information and put the rows/tuples in the correct buckets. This relies on Hive-11672.
This might require a shuffle within Storm.
Attachments
Issue Links
- is blocked by
-
HIVE-11672 Hive Streaming API handles bucketing incorrectly
- Resolved