Description
In order to address various shortcomings of Tika encoding detection, I've had to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. In the former, I placed a much greater weight on detected charsets declared in html meta elements & xml declarations. In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type headers.
However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight (for at least html meta elements), and perhaps ignore it altogether (unless it happens to match UTF-8, since it seems that incorrect declarations usually declare something other than UTF-8, when the correct charset should be UTF-8).
Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection errors to date have revolved around something other than UTF-8 being detected when the correct encoding was actually UTF-8, not the other way around.
Therefore, what I propose is the following:
(1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8 should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8 should be ignored.
(2) In the presence of a Content-Type header, any other declared hints should be ignored, unless they match UTF-8 and do not match the Content-Type header, in which case all hints, including the Content-Type header, should be ignored.
EDIT: The above 2 points are a simplification of what I've actually implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.
Attachments
Issue Links
- links to