Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.2, 3.1.1
-
None
-
None
Description
When reading a CSV file, spark.sql.timeParserPolicy is respected in actions on the resulting dataframe. But it's ignored in actions on dataframe's RDD.
E.g. say to parse dates in a CSV you need spark.sql.timeParserPolicy to be set to LEGACY. If you set the config, df.collect will work as you'd expect. However, df.collect.rdd will fail because it'll ignore the override and read the config value as EXCEPTION.
For instance:
test.csv
date 2/6/18
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") scala> val df = { | spark.read | .option("header", "true") | .option("dateFormat", "MM/dd/yy") | .schema("date date") | .csv("/Users/wraschkowski/Downloads/test.csv") | } df: org.apache.spark.sql.DataFrame = [date: date] scala> df.show +----------+ | date| +----------+ |2018-02-06| +----------+ scala> df.count res3: Long = 1 scala> df.rdd.count 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) at org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253) at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.time.format.DateTimeParseException: Text '2/6/18' could not be parsed at index 0 at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2046) at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874) at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:59) ... 32 more
Attachments
Attachments
Issue Links
- relates to
-
SPARK-38328 SQLConf.get flaky causes NON-default spark session settings being lost
- Open