Description
S3Guard was initially done on the premise that a new MetadataStore would be the source of truth, and that it wouldn't provide guarantees if updates were done without using S3Guard.
I've been seeing increased demand for better support for scenarios where operations are done on the data that can't reasonably be done with S3Guard involved. For example:
- A file is deleted using S3Guard, and replaced by some other tool. S3Guard can't tell the difference between the new file and delete / list inconsistency and continues to treat the file as deleted.
- An S3Guard-ed file is overwritten by a longer file by some other tool. When reading the file, only the length of the original file is read.
We could possibly have smarter behavior here by querying both S3 and the MetadataStore (even in cases where we may currently only query the MetadataStore in getFileStatus) and use whichever one has the higher modified time.
This kills the performance boost we currently get in some workloads with the short-circuited getFileStatus, but we could keep it with authoritative mode which should give a larger performance boost. At least we'd get more correctness without authoritative mode and a clear declaration of when we can make the assumptions required to short-circuit the process. If we can't consider S3Guard the source of truth, we need to defer to S3 more.
We'd need to be extra sure of any locality / time zone issues if we start relying on mod_time more directly, but currently we're tracking the modification time as returned by S3 anyway.
Attachments
Attachments
Issue Links
- blocks
-
HADOOP-14936 S3Guard: remove "experimental" from documentation
- Resolved
- is related to
-
HADOOP-15779 S3guard: add inconsistency detection metrics
- Resolved
-
HADOOP-15780 S3Guard: document how to deal with non-S3Guard processes writing data to S3Guarded buckets
- Resolved
-
HADOOP-16184 S3Guard: Handle OOB deletions and creation of a file which has a tombstone marker
- Resolved
-
HADOOP-15489 S3Guard to self update on directory listings of S3
- Resolved
- relates to
-
HADOOP-15625 S3A input stream to use etags/version number to detect changed source files
- Resolved
-
HADOOP-16185 S3Guard: Optimize performance of handling OOB operations in non-authoritative mode
- Resolved
- links to