Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1617

Date format extraction problem in XLS/XLSX

Details

    Description

      Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:

      2018.05.10 -> 10/05/18
      2002.02.02 -> 2/2/2

      We need this format:

      2018.05.10 -> 2018-05-10

      2002.02.02 -> 2002-02-02

      This occurs only when the field type is date. When the field type is text then the output is fine.

       

      Please help us with a recommendation with any settings in the pipeline (Tika configs, excel setting, OS local settings, etc.), or provide a fix. 

      Attachments

        1. exceldatum.xlsx
          9 kB
          Zoltan Farago

        Activity

          zfarago Zoltan Farago added a comment - - edited

          alexlumpov could you please take a look on this issue? It' been pending fo a long time, and we need a solution on that. Thank you!

          zfarago Zoltan Farago added a comment - - edited alexlumpov  could you please take a look on this issue? It' been pending fo a long time, and we need a solution on that. Thank you!
          zfarago Zoltan Farago added a comment -

          daddywri could you please help us and assign this task to an active developer? thank you in advance

          zfarago Zoltan Farago added a comment - daddywri  could you please help us and assign this task to an active developer? thank you in advance
          kwright@metacarta.com Karl Wright added a comment -

          Are you using the external Tika extractor, or the embedded one?

          kwright@metacarta.com Karl Wright added a comment - Are you using the external Tika extractor, or the embedded one?
          kwright@metacarta.com Karl Wright added a comment -

          The internal Tika extractor treats all metadata as strings, using the Tika library. I don't think the date format is configurable. Indeed, there's a blog post on this:

          https://grokbase.com/t/tika/user/10982he7yd/how-can-i-configure-tika-to-extract-dates-in-single-format

          Note that Tika tries to maintain the date format present in the original spreadsheet!!

          The solution proposed when you want a specific date format is this:

          • Write your own excel parser for Tika, which ignores the date formatting
            set for cells, and always uses iso8601

          That's not going to cut it here because we don't have any information that would allow us to autodetect the incoming format properly. It's basically just a text file and there are no hints, especially for dates like "01-01-2010". Which comes first, the day or the month?

          The external Tika extractor has even less configurability because you cannot run custom code there.

          Now, suppose all you want to do is post-process just dates to change the separator character. Well, we do not know whether the field being returned from Tika is a date even. If we replaced all /'s with -'s in it then we'd corrupt other kinds of fields.

          My conclusion: there's nothing we can do in ManifoldCF to fix this problem. A solution might be found in Tika itself, but only if somebody tickets it. Tika would need to go through the column definitions and understand which columns were dates and act accordingly. Feel free to open a Tika ticket accordingly.

          kwright@metacarta.com Karl Wright added a comment - The internal Tika extractor treats all metadata as strings, using the Tika library. I don't think the date format is configurable. Indeed, there's a blog post on this: https://grokbase.com/t/tika/user/10982he7yd/how-can-i-configure-tika-to-extract-dates-in-single-format Note that Tika tries to maintain the date format present in the original spreadsheet!! The solution proposed when you want a specific date format is this: Write your own excel parser for Tika, which ignores the date formatting set for cells, and always uses iso8601 That's not going to cut it here because we don't have any information that would allow us to autodetect the incoming format properly. It's basically just a text file and there are no hints, especially for dates like "01-01-2010". Which comes first, the day or the month? The external Tika extractor has even less configurability because you cannot run custom code there. Now, suppose all you want to do is post-process just dates to change the separator character. Well, we do not know whether the field being returned from Tika is a date even. If we replaced all /'s with -'s in it then we'd corrupt other kinds of fields. My conclusion: there's nothing we can do in ManifoldCF to fix this problem. A solution might be found in Tika itself, but only if somebody tickets it. Tika would need to go through the column definitions and understand which columns were dates and act accordingly. Feel free to open a Tika ticket accordingly.
          kwright@metacarta.com Karl Wright added a comment -

          I'm marking this as "won't fix" although it should really be "can't fix". If a Tika ticket gets created to address date format configurability then please include it here; if there's already some configurability present we can work with that. Thanks!

          kwright@metacarta.com Karl Wright added a comment - I'm marking this as "won't fix" although it should really be "can't fix". If a Tika ticket gets created to address date format configurability then please include it here; if there's already some configurability present we can work with that. Thanks!
          zfarago Zoltan Farago added a comment - - edited

          kwright@metacarta.com what do mean under Tika ticket creation? This ticket is related to all Tika components we found in the JIRA. Should I create a new one? 

          Is it possible to move this issue there? 

          zfarago Zoltan Farago added a comment - - edited kwright@metacarta.com  what do mean under Tika ticket creation? This ticket is related to all Tika components we found in the JIRA. Should I create a new one?  Is it possible to move this issue there? 
          kwright@metacarta.com Karl Wright added a comment -

          Jira -> Create -> pull down "TIKA" in the "Project" pulldown.

          kwright@metacarta.com Karl Wright added a comment - Jira -> Create -> pull down "TIKA" in the "Project" pulldown.

          People

            kwright@metacarta.com Karl Wright
            zfarago Zoltan Farago
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: