Description
Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors.
The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error).
Spark output that showed the exception:
16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_000074 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch. Container id: container_e68_1478924651089_0018_01_000074 Exit code: 134 Exception message: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 134
According to the source code, exit code 134 is 128+6, and 6 is SIGABRT 6 Core Abort signal from abort(3). The external signal killed executors.
On the YARN side, the log is more clear:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136 # # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03) # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 10342% C2 org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec] # # Core dump written. Default location: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core or core.3694385 # # An error report file with more information is saved as: # /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp #
And the hs_err_pid3694385.log shows the stack:
Stack: [0x00007fff85432000,0x00007fff85533000], sp=0x00007fff85530ce0, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc] j org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138 j org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209 j org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56 j org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62 j org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24 j org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11 j org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17 J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c] j org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4 j scala.collection.Iterator$$anon$11.hasNext()Z+4 j org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3 j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222 j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2 j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152 j org.apache.spark.executor.Executor$TaskRunner.run()V+423 j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub V [libjvm.so+0x63d6ba] V [libjvm.so+0x63ab74] V [libjvm.so+0x63b189] V [libjvm.so+0x67e6a1] V [libjvm.so+0x9b3f5a] V [libjvm.so+0x869722] C [libpthread.so.0+0x7dc5] start_thread+0xc5
This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, but easily reproducible on 100TB+...so look into data types that may not be big enough to handle hundreds of billion.
Attachments
Issue Links
- is cloned by
-
SPARK-18745 java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
- Resolved
- links to