Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We implement download resume of documents from mongodb for the indexing process. It works by saving the download state (last downloaded document's _modified and _id ) so that resume (if needed) could start from that point. The documents are first kept in memory and then dumped to file once the memory usage reaches a certain threshold. The state save is done after every dump.
However not every document downloaded from mongodb reaches this point i.e. saving to disk. Some of those documents are filtered eg. hidden nodes - https://github.com/apache/jackrabbit-oak/blob/24c54e500883c512e078275d1f85c2899404997c/oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/NodeStateEntryTraverser.java#L181
So, if a download thread keeps on getting such hidden nodes continuously, that progress is not saved and if the download fails, and retry happens, it will again download all those hidden nodes.
Attachments
Issue Links
- links to