Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8
-
None
-
macOS:
> uname -a Darwin Senzing-MacBook-Pro.local 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64 x86_64
> java -version openjdk version "11.0.14" 2022-01-18 OpenJDK Runtime Environment Temurin-11.0.14+9 (build 11.0.14+9) OpenJDK 64-Bit Server VM Temurin-11.0.14+9 (build 11.0.14+9, mixed mode)
Linux:
> uname -a Linux lnxdev 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> java -version openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
macOS : > uname -a Darwin Senzing-MacBook-Pro.local 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64 x86_64 > java -version openjdk version "11.0.14" 2022-01-18 OpenJDK Runtime Environment Temurin-11.0.14+9 (build 11.0.14+9) OpenJDK 64-Bit Server VM Temurin-11.0.14+9 (build 11.0.14+9, mixed mode) Linux : > uname -a Linux lnxdev 5.4.0-109- generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux > java -version openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
Description
I have my CSVFormat initialized such that withTrim(true) has been set (see attached ZIP file):
CSVFormat csvFormat = CSVFormat.DEFAULT.withFirstRecordAsHeader() .withIgnoreEmptyLines(true).withTrim(true);
However, a quoted string that begins after a delimiter followed by preceding whitespace is not properly parsed. For example:
GIVEN_NAME,SURNAME,ADDRESS,PHONE_NUMBER "Joe", "Schmoe","101 Main Street; Las Vegas, NV 89101","702-555-1212" "John","Doe", "201 First Street; Las Vegas, NV 89102", "702-555-1313" "Jane","Doe","301 Second Street; Las Vegas, NV 89103","702-555-1414"
- Notice the whitespace preceding "Schmoe" on the first record? This leads to the actual value containing the quotation marks instead of them being stripped off.
- The whitespace preceding "201 First Street; Las Vegas, NV 89102" on the second record leads to it to being parsed as two values: "201 First Street; Las Vegas and NV 89102".
- The third record is the only one that parses as expected.
I believe that this is because the trimming is done after the value is being parsed rather than consuming the whitespace following the delimiter during parsing. Either that, or the check for a quoted string is occurring before the whitespace is being consumed.
NOTE: I have attached a ZIP file that easily reproduces the problem with the CSV file given above.
To build the attached project use Apache Maven and then execute using using Java 11:
> unzip csvfail.zip
> cd csvfail
> mvn package
> java -jar target/csv-fail-1.0-SNAPSHOT.jar