[ANY23-418] Take another look at encoding detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.3
Component/s: encoding
Labels:
None

Description

In order to address various shortcomings of Tika encoding detection, I've had to modify the TikaEncodingDetector several times. Cf. ~~ANY23-385~~ and ~~ANY23-411~~. In the former, I placed a much greater weight on detected charsets declared in html meta elements & xml declarations. In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type headers.

However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight (for at least html meta elements), and perhaps ignore it altogether (unless it happens to match UTF-8, since it seems that incorrect declarations usually declare something other than UTF-8, when the correct charset should be UTF-8).

Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection errors to date have revolved around something other than UTF-8 being detected when the correct encoding was actually UTF-8, not the other way around.

Therefore, what I propose is the following:

(1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8 should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8 should be ignored.

(2) In the presence of a Content-Type header, any other declared hints should be ignored, unless they match UTF-8 and do not match the Content-Type header, in which case all hints, including the Content-Type header, should be ignored.

EDIT: The above 2 points are a simplification of what I've actually implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.

Attachments

Issue Links

links to

GitHub Pull Request #131

Activity

People

Assignee:: Hans Brende

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Nov/18 23:10

Updated:: 07/Feb/19 06:45

Resolved:: 07/Feb/19 06:15