Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6199

SolrJ, using SolrInputDocument methods, requires entire document to be loaded into memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.7.3
    • None
    • clients - java, SolrJ
    • None

    Description

      ManifoldCF has historically used Solr's extracting update handler for transmitting binary documents to Solr. Recently, we've included Tika processing of binary documents, and wanted instead to send an (unlimited by ManifoldCF) character stream as a primary content field to Solr instead. Unfortunately, it appears that the SolrInputDocument metaphor for receiving extracted content and metadata requires that all fields be completely converted to String objects. This will cause ManifoldCF to certainly run out of memory at some point, when multiple ManifoldCF threads all try to convert large documents to in-memory strings at the same time.

      I looked into what would be needed to add streaming support to UpdateRequest and SolrInputDocument. Basically, a legal option would be to set a field value that would be a Reader or a Reader[]. It would be straightforward to implement this, EXCEPT for the fact that SolrCloud apparently makes UpdateRequest copies, and copying a Reader isn't going to work unless there's a backing solid object somewhere. Even then, I could have gotten this to work by using a temporary file for large streams, but there's no signal from SolrCloud when it is done with its copies of UpdateRequest, so there's no place to free any backing storage.

      If anyone knows a good way to do non-extracting updates without loading entire documents into memory, please let me know.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kwright@metacarta.com Karl Wright
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: