[SPARK-18372] .Hive-staging folders created from Spark hiveContext are not getting cleaned up - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 1.5.2, 1.6.2, 1.6.3
Fix Version/s: 1.6.4
Component/s: SQL
Labels:
None
Environment:

spark standalone and spark yarn

External issue URL:
https://hwxmonarch.atlassian.net/browse/SPARK-185
External issue ID:
SPARK-185

Description

Steps to reproduce:
================
1. Launch spark-shell
2. Run the following scala code via Spark-Shell
scala> val hivesampletabledf = sqlContext.table("hivesampletable")
scala> import org.apache.spark.sql.DataFrameWriter
scala> val dfw : DataFrameWriter = hivesampletabledf.write
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )")
scala> dfw.insertInto("hivesampletablecopypy")
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=====
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
<property>
<name>hive.exec.stagingdir</name>
<value>$

{hive.exec.scratchdir}

{user.name}

/.staging</value>
</property>
a new .hive-staging folder was created in hive/warehouse/<tablename> folder
moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
The issue happens via Spark-submit as well - customer used the following command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

_thumb_37664.png
09/Nov/16 06:24
7 kB
Mingjie Tang

Issue Links

links to

[Github] Pull Request #15819 (merlintang)

Activity

People

Assignee:: Mingjie Tang

Reporter:: Mingjie Tang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Nov/16 00:40

Updated:: 06/Jun/17 17:46

Resolved:: 07/Jan/17 01:25