Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1244

FrameReader with CSV format have issues due to double quotes in some cases

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • SystemML 0.13
    • None
    • None

    Description

      This is an example for input data,
      It has three columns with TAB as a field separator.

      "20news-bydate-train/alt.atheism/49960" """" 88.0
      "20news-bydate-train/alt.atheism/49960" "#" 1.0

      Couple of observations so far:
      1. Double quote is considered as a part of input.
      2. Next Double quote is considered as end of input field.

      Attachments

        Activity

          acs_s Arvind Surve added a comment -

          mboehm7 , reinwald@us.ibm.com mentioned you are looking at the issue as well.

          acs_s Arvind Surve added a comment - mboehm7 , reinwald@us.ibm.com mentioned you are looking at the issue as well.
          mboehm7 Matthias Boehm added a comment -

          yes, I already fixed it and will push it to master with the next batch of changes.

          mboehm7 Matthias Boehm added a comment - yes, I already fixed it and will push it to master with the next batch of changes.
          acs_s Arvind Surve added a comment -

          Ok, I have created Pull Request. Please look at those changes before pushing your changes.

          acs_s Arvind Surve added a comment - Ok, I have created Pull Request. Please look at those changes before pushing your changes.
          mboehm7 Matthias Boehm added a comment -

          Just to clarify - there were two issues: (1) tokens that are a concatenation of quoted tokens (according to RFC4180) and non-quoted tokens were split after the last quote, and (2) incorrect parsing of frame meta data.

          We now made the related split and count functionality more robust with regard to these special cases without sacrificing performance for the common case without quotes.

          acs_s would you mind closing your related PR?

          mboehm7 Matthias Boehm added a comment - Just to clarify - there were two issues: (1) tokens that are a concatenation of quoted tokens (according to RFC4180) and non-quoted tokens were split after the last quote, and (2) incorrect parsing of frame meta data. We now made the related split and count functionality more robust with regard to these special cases without sacrificing performance for the common case without quotes. acs_s would you mind closing your related PR?
          acs_s Arvind Surve added a comment -

          I have verified this issue with SystemML 0.13 nightly build (Spark 2.0).

          This issue needs be addressed on Spark 1.6 (Branch 0.12) as well.
          mboehm7 Can you please put these changes into branch 0.12 as well before next build on branch 0.12?

          acs_s Arvind Surve added a comment - I have verified this issue with SystemML 0.13 nightly build (Spark 2.0). This issue needs be addressed on Spark 1.6 (Branch 0.12) as well. mboehm7 Can you please put these changes into branch 0.12 as well before next build on branch 0.12?
          mboehm7 Matthias Boehm added a comment -

          I could look into this early next week - but the code is in master if needed earlier.

          mboehm7 Matthias Boehm added a comment - I could look into this early next week - but the code is in master if needed earlier.

          People

            mboehm7 Matthias Boehm
            acs_s Arvind Surve
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: