[RAT-147] binary guesser design improvement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.8
Fix Version/s: 0.17
Component/s: None
Labels:
None

Description

A release manager cut a release; RAT was run, all was OK. Another user tried building from source / tag, and RAT complained of 2 files missing headers. This was traced to the "binary guesser" which read the 1st 200 bytes of a file and "guessed" if it was binary. The file in question had a UTF-8 byte-order mark at the beginning, and was, in fact after that, plain ASCII. The reason for 2 different results: the release manager's OS had a default file encoding set to US-ASCII (as determined by running a small Java program that prints out the value of System.property("file.encoding"). This encoding is for 7-bit ASCII, so the guesser when decoding this gets a malformed exception on the 3 bytes at the beginning of the file. This causes the guesser to conclude this is a "binary" file which doesn't need to be RAT-checked. The other user was on a Windows 7 machine, which has the file.encoding defaulting to Cp1252 - which does have code points defined for the first 3 bytes, and therefore doesn't throw any exception. This makes the guesser guess that this isn't a binary file, and it checks the file and reports a missing header (the file is test data...).

Workaround - add the file to the explicit excludes.

Potential problem - on a machine with default encoding US-ASCII, RAT will improperly skip checking files which perhaps should have headers, if they have a UTF-8 byte-order mark.

Potential problem #2 - RAT is dependent on the default file encoding setting for part of its behavior, causing differences in what it checks.

I'm not sure what a good solution would be here. It might range from eliminating the binary "guesser" that looks at the first 200 bytes of a file, to forcing UTF-8 as the charset to use.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

unix-newlines.txt.bin
18/Apr/24 07:31
0.1 kB
Richard Eckart de Castilho
windows-newlines.txt.bin
18/Apr/24 07:31
0.1 kB
Richard Eckart de Castilho

Issue Links

relates to

RAT-150 RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default

Resolved

Activity

People

Assignee:: Claude Warren

Reporter:: Marshall Schor

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Aug/13 20:57

Updated:: 04/May/24 15:53

Resolved:: 04/May/24 15:53