[SOLR-6199] SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.7.3
Fix Version/s: None
Component/s: clients - java, SolrJ
Labels:
None

Description

ManifoldCF has historically used Solr's extracting update handler for transmitting binary documents to Solr. Recently, we've included Tika processing of binary documents, and wanted instead to send an (unlimited by ManifoldCF) character stream as a primary content field to Solr instead. Unfortunately, it appears that the SolrInputDocument metaphor for receiving extracted content and metadata requires that all fields be completely converted to String objects. This will cause ManifoldCF to certainly run out of memory at some point, when multiple ManifoldCF threads all try to convert large documents to in-memory strings at the same time.

I looked into what would be needed to add streaming support to UpdateRequest and SolrInputDocument. Basically, a legal option would be to set a field value that would be a Reader or a Reader[]. It would be straightforward to implement this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies, and copying a Reader isn't going to work unless there's a backing solid object somewhere. Even then, I could have gotten this to work by using a temporary file for large streams, but there's no signal from SolrCloud when it is done with its copies of UpdateRequest, so there's no place to free any backing storage.

If anyone knows a good way to do non-extracting updates without loading entire documents into memory, please let me know.

Attachments

Issue Links

relates to

CONNECTORS-981 Solr Connector - classic Solrj SolrInputDocument support

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Karl Wright

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Jun/14 07:59

Updated:: 28/Oct/21 14:17