[OPENNLP-1525] Improve TokenizerME to make use of abbreviations provided in TokenizerModel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.3.2
Component/s: Tokenizer
Labels:
None

Description

While working on ~~OPENNLP-1479~~ and reviewing a PR by l-ma, we identified that TokenizerME doesn't make use of locale/language specific abbreviations provided via the corresponding TokenizerModel.

Therefore, terms will get mis-tokenized if they are abbreviated, such as the German token "S." which represents an abbreviated form of "Seite" (-> page). Instead of being tokenized as ["S."], TokenizerME will incorrectly yield: ["S", "."].

Improvement suggested:

Make use of the abbreviations dictionary provided by the TokenizerModel
Adapt the idea suggested and implemented in ~~OPENNLP-570~~ (SentenceDetectorME) for TokenizerME
Adjust TokenizerFactoryTest method testCustomPatternForTokenizerMEDeu() for German abbreviations, see sent-detector test case. It should expect and result in 14 tokens, instead of 16 - so there is a TODO here.

Attachments

Activity

People

Assignee:: Martin Wiesner

Reporter:: Martin Wiesner

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Dec/23 10:15

Updated:: 12/Dec/23 14:09

Resolved:: 12/Dec/23 14:09

Time Tracking

Estimated:

Remaining:

35m

Logged:

1h 25m