[SPARK-31256] Dropna doesn't work for struct columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.5
Fix Version/s: 2.4.6, 3.0.0
Component/s: PySpark
Labels:
None
Environment:

Spark 2.4.5

Python 3.7.4

Description

Dropna using a subset with a column from a struct drops the entire data frame.

import pyspark.sql.functions as F

df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, None)], schema=['age', 'height', 'name'])
df.show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|  null|  Bob|
| 15|    80| null|
+---+------+-----+

# this works just fine
df.dropna(subset=['name']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|  null|  Bob|
+---+------+-----+

# now add a struct column
df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 'name'))
df_with_struct.show(truncate=False)
+---+------+-----+--------------+
|age|height|name |struct_col    |
+---+------+-----+--------------+
|5  |80    |Alice|[5, 80, Alice]|
|10 |null  |Bob  |[10,, Bob]    |
|15 |80    |null |[15, 80,]     |
+---+------+-----+--------------+

# now dropna drops the whole dataframe when you use struct_col
df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
+---+------+----+----------+
|age|height|name|struct_col|
+---+------+----+----------+
+---+------+----+----------+

I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 with python 3.6.8 and in both, the result looks like:

df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
+---+------+-----+--------------+
|age|height|name |struct_col    |
+---+------+-----+--------------+
|5  |80    |Alice|[5, 80, Alice]|
|10 |null  |Bob  |[10,, Bob]    |
+---+------+-----+--------------+

Attachments

Activity

People

Assignee:: Terry Kim

Reporter:: Michael Souder

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Mar/20 22:56

Updated:: 20/Apr/20 05:01

Resolved:: 20/Apr/20 05:01