Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35324

Spark SQL configs not respected in RDDs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.2, 3.1.1
    • None
    • Input/Output, SQL
    • None

    Description

      When reading a CSV file, spark.sql.timeParserPolicy is respected in actions on the resulting dataframe. But it's ignored in actions on dataframe's RDD.

      E.g. say to parse dates in a CSV you need spark.sql.timeParserPolicy to be set to LEGACY. If you set the config, df.collect will work as you'd expect. However, df.collect.rdd will fail because it'll ignore the override and read the config value as EXCEPTION.

      For instance:

      test.csv
      date
      2/6/18
      
      scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
      
      scala> val df = {
           |   spark.read
           |     .option("header", "true")
           |     .option("dateFormat", "MM/dd/yy")
           |     .schema("date date")
           |     .csv("/Users/wraschkowski/Downloads/test.csv")
           | }
      df: org.apache.spark.sql.DataFrame = [date: date]
      
      scala> df.show
      +----------+                                                                    
      |      date|
      +----------+
      |2018-02-06|
      +----------+
      
      
      scala> df.count
      res3: Long = 1
      
      scala> df.rdd.count
      21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
      org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
      	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
      	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
      	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
      	at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
      	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
      	at scala.Option.getOrElse(Option.scala:189)
      	at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
      	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
      	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
      	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
      	at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
      	at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
      	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      	at org.apache.spark.scheduler.Task.run(Task.scala:131)
      	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
      	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.time.format.DateTimeParseException: Text '2/6/18' could not be parsed at index 0
      	at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2046)
      	at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
      	at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:59)
      	... 32 more
      

      Attachments

        1. Screen Shot 2021-05-06 at 00.33.10.png
          535 kB
          Willi Raschkowski
        2. Screen Shot 2021-05-06 at 00.35.10.png
          621 kB
          Willi Raschkowski

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rshkv Willi Raschkowski
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: