Details
Description
When running the below code on PySpark v2.2.0, the cached input DataFrame df disappears from SparkUI after SQLTransformer.transform(...) is called on it.
I don't yet know whether this is only a SparkUI bug, or the input DataFrame df is indeed unpersisted from memory. If the latter is true, this can be a serious bug because any new computation using new_df would have to re-do all the work leading up to df.
import pandas import pyspark from pyspark.ml.feature import SQLTransformer spark = pyspark.sql.SparkSession.builder.getOrCreate() df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1]))) # after below step, SparkUI Storage shows 1 cached RDD df.cache(); df.count() # after below step, cached RDD disappears from SparkUI Storage new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df)