Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.2.0, 2.2.1
-
None
Description
Spark incorrectly process cached data with Kryo & Unsafe options.
Distinct count from cache doesn't work correctly. Example available below:
val spark = SparkSession
.builder
.appName("unsafe-issue")
.master("local[*]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryo.unsafe", "true")
.config("spark.kryo.registrationRequired", "false")
.getOrCreate()val devicesDF = spark.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.load("/data/Devices.tsv").cache()val gatewaysDF = spark.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.load("/data/Gateways.tsv").cache()val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), "inner").cache()
devJoinedDF.printSchema()println(devJoinedDF.count())
println(devJoinedDF.select("DeviceId").distinct().count())
println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
Attachments
Attachments
Issue Links
- relates to
-
SPARK-27216 Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
- Resolved