Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Of late, we are incurring timeouts for job module 4 in azure for UT module.
Its very strange from the logs. There are only 15 tests triggered and all 15 of them are complete. And the test is just stuck in there. after 3 hour 40 mins or so, entire job run is cancelled and we don't see any logs in between.
Excerpt from logs:
2023-08-23T08:16:30.2299637Z [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 s - in org.apache.hudi.hadoop.TestAnnotation 2023-08-23T08:16:30.2319408Z [INFO] Running org.apache.hudi.hadoop.hive.TestHoodieCombineHiveInputFormat 2023-08-23T08:16:30.2421972Z Formatting using clusterid: testClusterID 2023-08-23T08:16:30.2570394Z 31120 [Listener at localhost/41109] WARN org.apache.hadoop.conf.Configuration [] - No unit for dfs.heartbeat.interval(3) assuming SECONDS 2023-08-23T08:16:30.2574429Z 31121 [Listener at localhost/41109] WARN org.apache.hadoop.conf.Configuration [] - No unit for dfs.namenode.safemode.extension(0) assuming MILLISECONDS 2023-08-23T08:16:30.5588638Z 31422 [Listener at localhost/41109] WARN org.apache.hadoop.conf.Configuration [] - No unit for dfs.heartbeat.interval(3) assuming SECONDS 2023-08-23T08:16:30.5589693Z 31422 [Listener at localhost/41109] WARN org.apache.hadoop.conf.Configuration [] - No unit for dfs.namenode.safemode.extension(0) assuming MILLISECONDS 2023-08-23T08:16:30.6633485Z 31526 [Listener at localhost/33189] WARN org.apache.hadoop.conf.Configuration [] - No unit for dfs.datanode.outliers.report.interval(1800000) assuming MILLISECONDS 2023-08-23T08:16:34.4630382Z 35326 [Listener at localhost/41177] WARN org.apache.hadoop.hdfs.server.datanode.DirectoryScanner [] - DirectoryScanner: shutdown has been called 2023-08-23T08:16:34.5845964Z 35447 [BP-118616970-10.1.148.0-1692778590260 heartbeating to localhost/127.0.0.1:33189] WARN org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager [] - IncrementalBlockReportManager interrupted 2023-08-23T08:16:34.5895718Z 35447 [BP-118616970-10.1.148.0-1692778590260 heartbeating to localhost/127.0.0.1:33189] WARN org.apache.hadoop.hdfs.server.datanode.DataNode [] - Ending block pool service for: Block pool BP-118616970-10.1.148.0-1692778590260 (Datanode Uuid eaf09d45-eb56-49ca-bec3-2da8a8f7bd14) service to localhost/127.0.0.1:33189 2023-08-23T08:16:34.5952216Z 35458 [refreshUsed-/tmp/hdfs-test-service16927785902315510072575960611105/data/data2/current/BP-118616970-10.1.148.0-1692778590260] WARN org.apache.hadoop.fs.CachingGetSpaceUsed [] - Thread Interrupted waiting to refresh disk information: sleep interrupted 2023-08-23T08:16:34.5971425Z 35458 [refreshUsed-/tmp/hdfs-test-service16927785902315510072575960611105/data/data1/current/BP-118616970-10.1.148.0-1692778590260] WARN org.apache.hadoop.fs.CachingGetSpaceUsed [] - Thread Interrupted waiting to refresh disk information: sleep interrupted 2023-08-23T08:16:34.6255412Z 35488 [2085456234@qtp-936129838-1 - Acceptor0 HttpServer2$SelectChannelConnectorWithSafeStartup@localhost:45383] WARN org.apache.hadoop.http.HttpServer2 [] - HttpServer Acceptor: isRunning is false. Rechecking. 2023-08-23T08:16:34.6289510Z 35492 [2085456234@qtp-936129838-1 - Acceptor0 HttpServer2$SelectChannelConnectorWithSafeStartup@localhost:45383] WARN org.apache.hadoop.http.HttpServer2 [] - HttpServer Acceptor: isRunning is false 2023-08-23T08:16:34.7378002Z [WARNING] Tests run: 4, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 4.503 s - in org.apache.hudi.hadoop.hive.TestHoodieCombineHiveInputFormat 2023-08-23T12:00:25.4660529Z ##[error]The operation was canceled. 2023-08-23T12:00:25.4674798Z ##[section]Finishing: UT other modules
Just check for log timing for last 3 lines.