Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Kudu has the ability to read historical data. But it is based by the timestamp produced by kudu transaction and mvcc system. The timestamp kudu used greatly weakened the usability.
For our use case. we write data to kudu from data stream. We use range partition by day.
We want to get the hour version from kudu. So we need read history data from kudu.
It produced by undo file. But when user give a timestamp, it means timestamp the event happen, associated with the data. Not the timestamp kudu produced. So we need a way to set event timestamp to the kudu system.
Finally, we got a way to solve this problem.
But our solution has two limit.
- We only update the table by a row, and for one row we have a timestamp with it.
- For getting the right history version of data, we need the data stream send data by event time order.
Despite these problems, it has satisfied our current business.
And our implement also solve part problem for the wrong order problem of event time if you only need the newest data, which will not read undo file.
for the data send into kudu, t1 < t2
t1 upsert -> t2 upsert -> newest will be t2 value
t2 upsert -> t1 upsret -> (current kudu implement) t1, our implement will be t2.
Maybe our solution is not the best for the problem. But I think kudu snapshot read should support event time.
Our solution is not so complete for all user cases. But I hope it will be useful for some cases with the community.