Status: Resolved
Resolution: Resolved
1.5.2, 1.6.2, 1.6.3
spark standalone and spark yarn
Steps to reproduce:
1. Launch spark-shell
2. Run the following scala code via Spark-Shell
scala> val hivesampletabledf = sqlContext.table("hivesampletable")
scala> import org.apache.spark.sql.DataFrameWriter
scala> val dfw : DataFrameWriter = hivesampletabledf.write
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )")
scala> dfw.insertInto("hivesampletablecopypy")
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
3. in HDFS (in our case, WASB), we can see the following folders
the issue is that these don't get cleaned up and get accumulated
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
a new .hive-staging folder was created in hive/warehouse/<tablename> folder
moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
The issue happens via Spark-submit as well - customer used the following command to reproduce this -