Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.2.0
-
None
-
None
Description
Some common data maintenance strategies, especially the Type 2 SCD update, would become substantially easier with a small extension to the SQL standard for MERGE, specifically the ability to say "when matched then insert". Per the standard, matched records can only be updated or deleted.
In the Type 2 SCD, when a new record comes in you update the old version of the record and insert the new version of the same record. If this extension were supported, sample Type 2 SCD code would look as follows:
merge into customer using new_customer_stage stage on stage.source_pk = customer.source_pk when not matched then insert values /* Insert a net new record */ (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null) when matched then update set /* Update an old record to mark it as out-of-date */ is_current = false, end_date = current_date() when matched then insert values /* Insert a new current record */ (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, true, null);
Without this support, the user needs to devise some sort of workaround. A common approach is to first left join the staging table against the table to be updated, then to join these results to a helper table that will spit out two records for each match and one record for each miss. One of the matching records needs to have a join key that can never occur in the source data so this requires precise knowledge of the source dataset.
An example of this:
merge into customer using ( select *, coalesce(invalid_key, source_pk) as join_key from ( select stage.source_pk, stage.name, stage.state, case when customer.source_pk is null then 1 when stage.name <> customer.name or stage.state <> customer.state then 2 else 0 end as scd_row_type from new_customer_stage stage left join customer on (stage.source_pk = customer.source_pk and customer.is_current = true) ) updates join scd_types on scd_types.type = scd_row_type ) sub on sub.join_key = customer.source_pk when matched then update set is_current = false, end_date = current_date() when not matched then insert values (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, null); select * from customer order by source_pk;
This code is very complicated and will fail if the "invalid" key ever shows up in the source dataset. This simple extension provides a lot of value and likely very little maintenance overhead.
/cc ekoifman
Attachments
Issue Links
- depends upon
-
HIVE-10924 add support for MERGE statement
- Resolved