Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2815

Priority of processing EML file should be TEXT_PLAIN instead of TEXT_HTML

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17, 1.18
    • None
    • parser
    • Source code MailContentHandler.java, function handleInlineBodyPart()

    Description

      From the source code MailContentHandler.java, handleInlineBodyPart() function, we notice that in the processing of the EML files, the priority is to get the TEXT_HTML, followed by application/rtf, and finally TEXT_PLAIN.

      However, as per my explanation in TIKA-2814, the content in TEXT_HTML is not cleaned, whereas the content in TEXT_PLAIN is clean and readable. 

      As such, we should set the priority to be getting from TEXT_PLAIN as the first priority. This will prevent all the unwanted words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" to be extracted out, and it could even lead to faster processing speed,

      I have uploaded a sample EML file here: https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing

      It has both the text/html section and text/plain section, and you can see that the text/plain section is way much cleaner and readable, as compared to text/html.

      Attachments

        Activity

          People

            Unassigned Unassigned
            edwinyeozl Edwin Yeo Zheng Lin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: