Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
None
-
None
Description
FSUtils getAllPartitionPaths is expected to work if metadata table is enabled or not. It can also be called inside spark context. But looks like we are trying to improve parallelism and causing NotSerializableExceptions. There are multiple callers using it within spark context (clustering/cleaner).
See stack trace below
21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
at org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
at org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
at org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
at org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
at org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
at com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 53, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.Path, value: hdfs://...)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(hdfs://...))
- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@735)
- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
- object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2091)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2110)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:968)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:967)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:79)
at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:79)
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:319)