[OAK-7193] DataStore: API to retrieve statistic (file headers, size estimation) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: blob
Labels:
None

Description

Extension of OAK-6254: in addition to retrieving the size, it would be good to retrieve the estimated number and total size per file type. A simple (and in my view sufficient) solution is to use the first few bytes ("magic numbers", 2 bytes should be enough) to get the file type. That would allow to estimate, for example, the number of, and total size, of PDF files, JPEG, Lucene index and so on. A histogram would be nice as well, but I think is not needed.

To speed up calculation, the blob ID could be extended with the first 2 bytes of the file content, that is: <hash>#<length>@<magic> where magic is the first two bytes, in hex. That would allow to quickly get the data from the blob ids (no need to actually read content).

Sampling should be enough. The longer it takes, the more accurate the data. We could store the data while doing datastore GC, in which case the returned data would be somewhat stale; that's OK.

Attachments

Issue Links

relates to

OAK-6254 DataStore: API to retrieve approximate storage size

Open

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Mueller

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Jan/18 08:57

Updated:: 26/May/21 14:52