Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.17, 1.18
-
None
-
Source code MailContentHandler.java, function handleInlineBodyPart()
Description
From the source code MailContentHandler.java, handleInlineBodyPart() function, we notice that in the processing of the EML files, the priority is to get the TEXT_HTML, followed by application/rtf, and finally TEXT_PLAIN.
However, as per my explanation in TIKA-2814, the content in TEXT_HTML is not cleaned, whereas the content in TEXT_PLAIN is clean and readable.
As such, we should set the priority to be getting from TEXT_PLAIN as the first priority. This will prevent all the unwanted words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" to be extracted out, and it could even lead to faster processing speed,
I have uploaded a sample EML file here: https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
It has both the text/html section and text/plain section, and you can see that the text/plain section is way much cleaner and readable, as compared to text/html.