Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.10.2
-
None
-
None
Description
We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in debugging it.
We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle handler.
2022-11-30 13:19:26627 [INFO] Fetcher_O {Map_4 -> Reducer_5} #5 |orderedgrouped.FetcherOrderedGrouped|:
345 2022-11-30 13:19:26628 [INFO] Fetcher_O {Map_4 -> Reducer_5} #5 |orderedgrouped.FetcherOrderedGrouped|: Read timed out
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
java.io.BufferedInputStream.read(BufferedInputStream.java:345)
java.io.FilterInputStream.read(FilterInputStream.java:133)
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
java.io.DataInputStream.readByte(DataInputStream.java:265)
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)
org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)]
2022-11-30 13:19:26628 [WARN] Fetcher_O {Map_4 -> Reducer_5} #5 |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed retry it.
In the node manager logs, i can see and other logs
2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request verfied. enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....
There could be some issue in closing the connections while upgrading netty version.
The case is reproducible every time, let me know if you need more logs or want me to check something.
cc: abstractdog
Attachments
Attachments
Issue Links
- Is contained by
-
TEZ-4481 Prepare for Tez 0.10.3 release
- Open
- is related to
-
HADOOP-15327 Upgrade MR ShuffleHandler to use Netty4
- Resolved
- links to