Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3289

FST should allow controlling how hard builder tries to share suffixes

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.4, 4.0-ALPHA
    • None
    • None
    • New

    Description

      Today we have a boolean option to the FST builder telling it whether
      it should share suffixes.

      If you turn this off, building is much faster, uses much less RAM, and
      the resulting FST is a prefix trie. But, the FST is larger than it
      needs to be. When it's on, the builder maintains a node hash holding
      every node seen so far in the FST – this uses up RAM and slows things
      down.

      On a dataset that Elmer (see java-user thread "Autocompletion on large
      index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles
      avg 67.3 chars per title, building with suffix sharing on took 22.5
      seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix
      sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

      I think we should allow this boolean to be shade-of-gray instead:
      usually, how well suffixes can share is a function of how far they are
      from the end of the string, so, by adding a tunable N to only share
      when suffix length < N, we can let caller make reasonable tradeoffs.

      Attachments

        1. LUCENE-3289.patch
          3 kB
          Michael McCandless
        2. LUCENE-3289.patch
          17 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: