Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32147

Spark: PartitionBy changing the columns value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0.0
    • None
    • Spark Core, Spark Shell

    Description

      While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values.

      Below is the example 

      scala> val df = Seq(
       | ("9q", 1),
       | ("3k", 2),
       | ("6f", 3),
       | ("7f", 4),
       | ("7d", 5)
       | ).toDF("value", "id")
      df: org.apache.spark.sql.DataFrame = [value: string, id: int]
      scala> df.show(false)
      +-----+---+
      |value|id |
      +-----+---+
      |  9q | 1 |
      |  3k | 2 |
      |  6f | 3 |
      |  7f | 4 |
      |  7d | 5 |
      +-----+---+
      
      scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
      scala> spark.read.parquet("tmp_parquet").show(false)
      +---+-----+
      |id |value|
      +---+-----+
      |5  | 7.0 |
      |3  | 6.0 |
      |2  | 3k  |
      |4  | 7.0 |
      |1  | 9q  |
      +---+-----+
      
      

      Same with the other format too, Is this a bug or is it normal.

      Taken from [SO|https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            kornsanz Shankar Koirala
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: