Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.18
-
Replicable everywhere in all environments
-
Working on a patch for this issue.
Description
Boilerpipe extractor in Tika miss to capture the space and new-line character in HTML.
Also, additional new-line characters are inserted in between the text.
Example URL - https://en.wikipedia.org/wiki/Blobfish
Missing space in "family Psychrolutidae" and additional new-line characters around round brackets '('
Related issue reported long back - https://issues.apache.org/jira/browse/TIKA-961
Attachments
Issue Links
- links to