Description
With input text "Indanyl", the encoding is recognized as IBM500, even when "UTF-8" is specified explicitly.
I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539.
Attachments
Issue Links
- duplicates
-
TIKA-771 "Hello, World!" in UTF-8/ASCII gets detected as IBM500
- Resolved
- is related to
-
TIKA-2771 enableInputFilter() wrecks charset detection for some short html documents
- Open
-
TIKA-539 Encoding detection is too biased by encoding in meta tag
- Reopened
- relates to
-
TIKA-2047 TXTParser overwrites mime type/masks types that are subtype of text
- Resolved