Details
Description
Dropna using a subset with a column from a struct drops the entire data frame.
import pyspark.sql.functions as F df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, None)], schema=['age', 'height', 'name']) df.show() +---+------+-----+ |age|height| name| +---+------+-----+ | 5| 80|Alice| | 10| null| Bob| | 15| 80| null| +---+------+-----+ # this works just fine df.dropna(subset=['name']).show() +---+------+-----+ |age|height| name| +---+------+-----+ | 5| 80|Alice| | 10| null| Bob| +---+------+-----+ # now add a struct column df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 'name')) df_with_struct.show(truncate=False) +---+------+-----+--------------+ |age|height|name |struct_col | +---+------+-----+--------------+ |5 |80 |Alice|[5, 80, Alice]| |10 |null |Bob |[10,, Bob] | |15 |80 |null |[15, 80,] | +---+------+-----+--------------+ # now dropna drops the whole dataframe when you use struct_col df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) +---+------+----+----------+ |age|height|name|struct_col| +---+------+----+----------+ +---+------+----+----------+
I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 with python 3.6.8 and in both, the result looks like:
df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) +---+------+-----+--------------+ |age|height|name |struct_col | +---+------+-----+--------------+ |5 |80 |Alice|[5, 80, Alice]| |10 |null |Bob |[10,, Bob] | +---+------+-----+--------------+