Description
While working on OPENNLP-1479 and reviewing a PR by l-ma, we identified that TokenizerME doesn't make use of locale/language specific abbreviations provided via the corresponding TokenizerModel.
Therefore, terms will get mis-tokenized if they are abbreviated, such as the German token "S." which represents an abbreviated form of "Seite" (-> page). Instead of being tokenized as ["S."], TokenizerME will incorrectly yield: ["S", "."].
Improvement suggested:
- Make use of the abbreviations dictionary provided by the TokenizerModel
- Adapt the idea suggested and implemented in
OPENNLP-570(SentenceDetectorME) for TokenizerME - Adjust TokenizerFactoryTest method testCustomPatternForTokenizerMEDeu() for German abbreviations, see sent-detector test case. It should expect and result in 14 tokens, instead of 16 - so there is a TODO here.