[OAK-9747] Download resume needs to handle hidden nodes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: indexing
Labels:
None

Description

We implement download resume of documents from mongodb for the indexing process. It works by saving the download state (last downloaded document's _modified and _id ) so that resume (if needed) could start from that point. The documents are first kept in memory and then dumped to file once the memory usage reaches a certain threshold. The state save is done after every dump.

However not every document downloaded from mongodb reaches this point i.e. saving to disk. Some of those documents are filtered eg. hidden nodes - https://github.com/apache/jackrabbit-oak/blob/24c54e500883c512e078275d1f85c2899404997c/oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/NodeStateEntryTraverser.java#L181

So, if a download thread keeps on getting such hidden nodes continuously, that progress is not saved and if the download fails, and retry happens, it will again download all those hidden nodes.

Attachments

Issue Links

links to

GitHub Pull Request #536

Activity

People

Assignee:: Thomas Mueller

Reporter:: Thomas Mueller

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Apr/22 08:59

Updated:: 19/Jan/23 12:49