Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-421

Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • tools-1.5.2-incubating
    • 2.3.2
    • Name Finder
    • RedHat 5, JDK 1.6.0_29

    Description

      The current implementation of StringList:

      https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
      calls intern() on every String. Presumably this is an attempt to reduce memory usage for duplicate tokens. Interned Strings are stored in the JVM's permanent generation, which has a small fixed size (seems to be about 83 MB on modern 64-bit JVMs: http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html). Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen space.

      The size of the PermGen can be increased with the -XX:MaxPermSize= option to the JVM. However, this option is non-standard and not well known, and it would be nice if OpenNLP worked out of the box without deep JVM tuning.

      This immediate problem could be fixed by simply not interning Strings. Looking at the Dictionary and DictionaryNameFinder code as a whole, however, there is a huge amount of room for performance improvement. Currently, DictionaryNameFinder.find works something like this:

      for every token in every tokenlist in the dictionary:
      copy it into a "meta dictionary" of single tokens

      for every possible subsequence of tokens in the sentence: // of which there are O(N^2)
      copy the sequence into a new array
      if the last token is in the "meta dictionary":
      make a StringList from the tokens
      look it up in the dictionary

      Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which wraps StringList, which wraps Array<String>. Every entry in the dictionary requires at least four allocated objects (in addition to the Strings): Array, StringList, StringListWrapper, and HashMap.Entry. Even contains and remove allocate new objects!

      From this comment in DictionaryNameFinder:

      // TODO: improve performance here

      It seems like improvements would be welcome. Removing some of the object overhead would more than make up for interning strings. Should I create a new Jira ticket to propose a more efficient design?

      Attachments

        Activity

          People

            rzo1 Richard Zowalla
            jayqhacker Jay Hacker
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified