[SPARK-18458] core dumped running Spark SQL on large data volume (100TB) - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.1.0
Component/s: SQL
Labels:
- core
- dump

Target Version/s:

2.1.0

Description

Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors.

The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error).

Spark output that showed the exception:

16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_000074 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_e68_1478924651089_0018_01_000074
Exit code: 134
Exception message: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
        at org.apache.hadoop.util.Shell.run(Shell.java:456)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 134

According to the source code, exit code 134 is 128+6, and 6 is SIGABRT 6 Core Abort signal from abort(3). The external signal killed executors.

On the YARN side, the log is more clear:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136
#
# JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03)
# Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 10342% C2 org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec]
#
# Core dump written. Default location: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core or core.3694385
#
# An error report file with more information is saved as:
# /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

And the hs_err_pid3694385.log shows the stack:

Stack: [0x00007fff85432000,0x00007fff85533000],  sp=0x00007fff85530ce0,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc]
j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138
j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209
j  org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56
j  org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62
j  org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24
j  org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11
j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552
j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17
J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c]
j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4
j  scala.collection.Iterator$$anon$11.hasNext()Z+4
j  org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3
j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222
j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2
j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152
j  org.apache.spark.executor.Executor$TaskRunner.run()V+423
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x63d6ba]
V  [libjvm.so+0x63ab74]
V  [libjvm.so+0x63b189]
V  [libjvm.so+0x67e6a1]
V  [libjvm.so+0x9b3f5a]
V  [libjvm.so+0x869722]
C  [libpthread.so.0+0x7dc5]  start_thread+0xc5

This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, but easily reproducible on 100TB+...so look into data types that may not be big enough to handle hundreds of billion.

Attachments

Issue Links

is cloned by

SPARK-18745 java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

Resolved

links to

[Github] Pull Request #15907 (kiszk)

core dumped running Spark SQL on large data volume (100TB)

Details

Description

Attachments

Issue Links

Activity

People

Dates