Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1525

Improve TokenizerME to make use of abbreviations provided in TokenizerModel

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3.1
    • 2.3.2
    • Tokenizer
    • None

    Description

      While working on OPENNLP-1479 and reviewing a PR by l-ma, we identified that TokenizerME doesn't make use of locale/language specific abbreviations provided via the corresponding TokenizerModel.

      Therefore, terms will get mis-tokenized if they are abbreviated, such as the German token "S." which represents an abbreviated form of "Seite" (-> page). Instead of being tokenized as ["S."], TokenizerME will incorrectly yield: ["S", "."].

      Improvement suggested:

      • Make use of the abbreviations dictionary provided by the TokenizerModel
      • Adapt the idea suggested and implemented in OPENNLP-570 (SentenceDetectorME) for TokenizerME
      • Adjust TokenizerFactoryTest method testCustomPatternForTokenizerMEDeu() for German abbreviations, see sent-detector test case. It should expect and result in 14 tokens, instead of 16 - so there is a TODO here.

       

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Time Spent - 1h 25m Remaining Estimate - 35m
                35m
                Logged:
                Time Spent - 1h 25m Remaining Estimate - 35m
                1h 25m