Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-324

Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.3, 0.4, 0.5
    • 0.6
    • cli
    • None
    • Mac OS 10.5, java version "1.6.0_15"

    Description

      When using the -t flag to tika, multi-byte content is destroyed in the output.

      Example:

      $ java -jar tika-app-0.4.jar -t ./test.txt
      I?t?rn?ti?n?liz?ti?n

      $ java -jar tika-app-0.4.jar -x ./test.txt
      <?xml version="1.0" encoding="UTF-8"?>
      <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <title/>
      </head>
      <body>
      <p>Iñtërnâtiônàlizætiøn
      </p>
      </body>
      </html>

      see also: http://drupal.org/node/622508#comment-2267918

      Attachments

        1. test.txt
          0.0 kB
          Peter Wolanin
        2. TIKA-324.patch
          1 kB
          Peter Wolanin
        3. TIKA-324-0.5.patch
          0.6 kB
          Peter Wolanin
        4. TIKA-324.patch
          1 kB
          Peter Wolanin
        5. TIKA-324-macosx.patch
          0.8 kB
          Jukka Zitting
        6. TIKA-324-README.patch
          0.8 kB
          Peter Wolanin

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              pwolanin Peter Wolanin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified