Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
Motivation
Spark's implementation of Iceberg allows users to add snapshot properties, when writing data to an Iceberg table, using properties prefixed with "snapshot-property." like so:
df.write
.option("write-format", "avro")
.option("snapshot-property.key", "value")
.insertInto("catalog.db.table")
https://iceberg.apache.org/docs/latest/spark-configuration/#write-options
These properties can be used to add context to Iceberg snapshots and help users locate snapshots in recovery scenarios.
In fact, Spark automatically adds the application name as spark.app.id.
Examples of when these properties might be useful include:
- Recording the data source used to produce the new records
- UUID of flow file used to update the table so it can be matched to NiFi provenance
They can be queried from the snapshots metatable (feature of Iceberg).
Feature request
It would be great if we could configure PutIceberg to add these properties in a similar fashion (e.g. using dynamic properties of the form snapshot-property.*). Continuing with the comparison to Spark, it may also be worth automatically adding the flowfile UUID as something like nifi.flowfile.id.
Further details
I'm not entirely clued up on the Iceberg API, but it looks like these are set on the SnapshotUpdate (AppendFiles inherits from this class):
https://iceberg.apache.org/javadoc/master/org/apache/iceberg/SnapshotUpdate.html
Attachments
Issue Links
- links to