Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.7.3
-
None
-
None
Description
ManifoldCF has historically used Solr's extracting update handler for transmitting binary documents to Solr. Recently, we've included Tika processing of binary documents, and wanted instead to send an (unlimited by ManifoldCF) character stream as a primary content field to Solr instead. Unfortunately, it appears that the SolrInputDocument metaphor for receiving extracted content and metadata requires that all fields be completely converted to String objects. This will cause ManifoldCF to certainly run out of memory at some point, when multiple ManifoldCF threads all try to convert large documents to in-memory strings at the same time.
I looked into what would be needed to add streaming support to UpdateRequest and SolrInputDocument. Basically, a legal option would be to set a field value that would be a Reader or a Reader[]. It would be straightforward to implement this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies, and copying a Reader isn't going to work unless there's a backing solid object somewhere. Even then, I could have gotten this to work by using a temporary file for large streams, but there's no signal from SolrCloud when it is done with its copies of UpdateRequest, so there's no place to free any backing storage.
If anyone knows a good way to do non-extracting updates without loading entire documents into memory, please let me know.
Attachments
Issue Links
- relates to
-
CONNECTORS-981 Solr Connector - classic Solrj SolrInputDocument support
- Resolved