Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3515

Tika CLI -t should use UTF-8 as default output encoding

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0, 1.27
    • 2.1.0
    • None
    • None
    • Windows 10, Liberica OpenJDK FULL x64 1.8.0_302

    Description

      Some Korean chars are extracted as squares. The encodings of plain texts are detected correctly. Maybe this is related with the content handler (just a guess). I'll attach the triggering files.

      Attachments

        1. LIVE-Seoul-ntfs-utf-16-be.txt
          2 kB
          Luís Filipe Nassif
        2. LIVE-Seoul-ntfs-utf-16-le.txt
          2 kB
          Luís Filipe Nassif
        3. LIVE-Seoul-ntfs-utf-8.txt
          1 kB
          Luís Filipe Nassif
        4. Korean lessons_ Lesson 2 – Learnkorean.com.pdf
          865 kB
          Luís Filipe Nassif
        5. Screen Shot 2021-08-06 at 5.50.04 PM.png
          128 kB
          Tim Allison
        6. Screen Shot 2021-08-06 at 5.50.21 PM.png
          105 kB
          Tim Allison
        7. Screen Shot tika-app.png
          99 kB
          Tim Allison
        8. image-2021-08-09-14-37-30-552.png
          98 kB
          Luís Filipe Nassif
        9. image-2021-08-09-14-38-26-763.png
          45 kB
          Luís Filipe Nassif
        10. LIVE-Seoul-ntfs-utf-8_-t_output.txt
          1 kB
          Luís Filipe Nassif
        11. LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml
          2 kB
          Luís Filipe Nassif

        Issue Links

          Activity

            People

              tallison Tim Allison
              lfcnassif Luís Filipe Nassif
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: