Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.24.1
-
None
-
Important
Description
Some RTF files, when created in libreoffice writer seem to not be parsed correctly. The RTFParser seems to extract only a portion of the text (ex: the title).
However if the same file is opened in a Windows Word and saved again as an RTF file, the parser is able to extract the full text.
An example file is attached in the ticket.
And this would be a small snippet of the parser:
private static final Set<MediaType> EXCLUDES = Collections.singleton(MediaType.application("x-tika-ooxml")); private static final Parser PARSERS[] = new Parser[] { new RTFParser() }; private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS); private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE); public String parse(InputStream content) { return TIKA_INSTANCE.parseToString(content) }