Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23178

Kryo Unsafe problems with count distinct from cache

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.2.0, 2.2.1
    • None
    • Spark Core

    Description

      Spark incorrectly process cached data with Kryo & Unsafe options.

      Distinct count from cache doesn't work correctly. Example available below:

      val spark = SparkSession
              .builder
              .appName("unsafe-issue")
              .master("local[*]")
              .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
              .config("spark.kryo.unsafe", "true")
              .config("spark.kryo.registrationRequired", "false")
              .getOrCreate()

          val devicesDF = spark.read.format("csv")
          .option("header", "true")
          .option("delimiter", "\t")
          .load("/data/Devices.tsv").cache()

          val gatewaysDF = spark.read.format("csv")
          .option("header", "true")
          .option("delimiter", "\t")
          .load("/data/Gateways.tsv").cache()

          val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), "inner").cache()
          devJoinedDF.printSchema()

          println(devJoinedDF.count())

          println(devJoinedDF.select("DeviceId").distinct().count())

          println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())

          println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())

      Attachments

        1. Unsafe-off.png
          19 kB
          KIryl Sultanau
        2. Unsafe-issue.png
          12 kB
          KIryl Sultanau

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kirills2006 KIryl Sultanau
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: