[SYSTEMDS-1244] FrameReader with CSV format have issues due to double quotes in some cases - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: SystemML 0.13
Component/s: None
Labels:
None

Description

This is an example for input data,
It has three columns with TAB as a field separator.

"20news-bydate-train/alt.atheism/49960" """" 88.0
"20news-bydate-train/alt.atheism/49960" "#" 1.0

Couple of observations so far:
1. Double quote is considered as a part of input.
2. Next Double quote is considered as end of input field.

Attachments

Activity

Ascending order - Click to sort in descending order

Arvind Surve added a comment - 10/Feb/17 22:15

mboehm7 , reinwald@us.ibm.com mentioned you are looking at the issue as well.

Arvind Surve added a comment - 10/Feb/17 22:15 mboehm7 , reinwald@us.ibm.com mentioned you are looking at the issue as well.

Matthias Boehm added a comment - 11/Feb/17 07:34

yes, I already fixed it and will push it to master with the next batch of changes.

Matthias Boehm added a comment - 11/Feb/17 07:34 yes, I already fixed it and will push it to master with the next batch of changes.

Arvind Surve added a comment - 11/Feb/17 19:04

Ok, I have created Pull Request. Please look at those changes before pushing your changes.

Arvind Surve added a comment - 11/Feb/17 19:04 Ok, I have created Pull Request. Please look at those changes before pushing your changes.

Matthias Boehm added a comment - 15/Feb/17 04:15

Just to clarify - there were two issues: (1) tokens that are a concatenation of quoted tokens (according to RFC4180) and non-quoted tokens were split after the last quote, and (2) incorrect parsing of frame meta data.

We now made the related split and count functionality more robust with regard to these special cases without sacrificing performance for the common case without quotes.

acs_s would you mind closing your related PR?

Matthias Boehm added a comment - 15/Feb/17 04:15 Just to clarify - there were two issues: (1) tokens that are a concatenation of quoted tokens (according to RFC4180) and non-quoted tokens were split after the last quote, and (2) incorrect parsing of frame meta data. We now made the related split and count functionality more robust with regard to these special cases without sacrificing performance for the common case without quotes. acs_s would you mind closing your related PR?

Arvind Surve added a comment - 20/Feb/17 05:18

I have verified this issue with SystemML 0.13 nightly build (Spark 2.0).

This issue needs be addressed on Spark 1.6 (Branch 0.12) as well.
mboehm7 Can you please put these changes into branch 0.12 as well before next build on branch 0.12?

Arvind Surve added a comment - 20/Feb/17 05:18 I have verified this issue with SystemML 0.13 nightly build (Spark 2.0). This issue needs be addressed on Spark 1.6 (Branch 0.12) as well. mboehm7 Can you please put these changes into branch 0.12 as well before next build on branch 0.12?

Matthias Boehm added a comment - 20/Feb/17 06:42

I could look into this early next week - but the code is in master if needed earlier.

Matthias Boehm added a comment - 20/Feb/17 06:42 I could look into this early next week - but the code is in master if needed earlier.

People

Assignee:: Matthias Boehm

Reporter:: Arvind Surve

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Feb/17 22:11

Updated:: 20/Feb/17 06:42

Resolved:: 15/Feb/17 04:00

SystemDS