Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25692

Remove static initialization of worker eventLoop handling chunk fetch requests within TransportContext

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • Spark Core
    • None

    Description

      Looks like the whole test suite is pretty flaky. See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/

      This may be a regression in 3.0 as this didn't happen in 2.4 branch.

      Attachments

        1. Screen Shot 2018-11-01 at 10.17.16 AM.png
          536 kB
          Shixiong Zhu
        2. Screen Shot 2018-10-22 at 4.12.41 PM.png
          78 kB
          Sanket Reddy

        Issue Links

          Activity

            zsxwing Shixiong Zhu added a comment - It may be caused by https://github.com/apache/spark/pull/22173
            tgraves Thomas Graves added a comment -

            redsanket can you please take a look at this

            tgraves Thomas Graves added a comment - redsanket can you please take a look at this
            sanket991 Sanket Reddy added a comment -

            Sure tgraves zsxwing i am taking a look. Thanks for reporting.

            sanket991 Sanket Reddy added a comment - Sure tgraves zsxwing i am taking a look. Thanks for reporting.
            sanket991 Sanket Reddy added a comment -

            zsxwing I haven't been able to reproduce this, but the number of threads used for this tests are 2* number of cores or spark.shuffle.io.serverThreads. I think the test server seems to be down. Would appreciate any help to reproduce this.

            sanket991 Sanket Reddy added a comment - zsxwing  I haven't been able to reproduce this, but the number of threads used for this tests are 2* number of cores or spark.shuffle.io.serverThreads. I think the test server seems to be down. Would appreciate any help to reproduce this.
            sanket991 Sanket Reddy added a comment -

            https://github.com/apache/spark/pull/22628/files went it after https://github.com/apache/spark/pull/22173 not sure you are still seeing issues... I ran the tests in a loop for 100 times. Could not reproduce this.

            sanket991 Sanket Reddy added a comment - https://github.com/apache/spark/pull/22628/files  went it after  https://github.com/apache/spark/pull/22173  not sure you are still seeing issues... I ran the tests in a loop for 100 times. Could not reproduce this.
            zsxwing Shixiong Zhu added a comment -

            It's still flaky on Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/fetchBothChunks/history/

            You may need to run the whole tests together. https://github.com/apache/spark/pull/22173 added a global thread pool, so other tests may also impact this test suite.

            zsxwing Shixiong Zhu added a comment - It's still flaky on Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/fetchBothChunks/history/ You may need to run the whole tests together.  https://github.com/apache/spark/pull/22173  added a global thread pool, so other tests may also impact this test suite.
            zsxwing Shixiong Zhu added a comment -

            sanket991 You can download the unit test logs from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/artifact/common/network-common/target/ (It will be kept on Jenkins for several days)

            zsxwing Shixiong Zhu added a comment - sanket991 You can download the unit test logs from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/5554/artifact/common/network-common/target/ (It will be kept on Jenkins for several days)
            dongjoon Dongjoon Hyun added a comment - - edited

            Hi, zsxwing and tgraves.

            While looking other failures, I notice that this failure still happens frequently.

            The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related.

            dongjoon Dongjoon Hyun added a comment - - edited Hi, zsxwing and tgraves . While looking other failures, I notice that this failure still happens frequently. The failure is always `fetchBothChunks`. `amp-jenkins-worker-05` machine might be related. master 5856 (amp-jenkins-worker-05) master 5837 (amp-jenkins-worker-05) master 5835 (amp-jenkins-worker-05) master 5829 (amp-jenkins-worker-05) master 5828 (amp-jenkins-worker-05) master 5822 (amp-jenkins-worker-05) master 5814 (amp-jenkins-worker-05) SparkPullRequestBuilder 100784 (amp-jenkins-worker-05) SparkPullRequestBuilder 100785 (amp-jenkins-worker-05) SparkPullRequestBuilder 100787 (amp-jenkins-worker-05) SparkPullRequestBuilder 100788 (amp-jenkins-worker-05)
            dongjoon Dongjoon Hyun added a comment - This is resolved via https://github.com/apache/spark/pull/23522 .
            dongjoon Dongjoon Hyun added a comment - - edited

            Sorry, guys. I'm reopening this since it's observed again.

            • https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5888/consoleFull (amp-jenkins-worker-05)
              [INFO] Running org.apache.spark.network.ChunkFetchIntegrationSuite
              [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 60.109 s <<< FAILURE! - in org.apache.spark.network.ChunkFetchIntegrationSuite
              [ERROR] fetchBothChunks(org.apache.spark.network.ChunkFetchIntegrationSuite)  Time elapsed: 60.017 s  <<< FAILURE!
              java.lang.AssertionError: Timeout getting response from the server
              	at org.apache.spark.network.ChunkFetchIntegrationSuite.fetchChunks(ChunkFetchIntegrationSuite.java:176)
              	at org.apache.spark.network.ChunkFetchIntegrationSuite.fetchBothChunks(ChunkFetchIntegrationSuite.java:210)
              
            dongjoon Dongjoon Hyun added a comment - - edited Sorry, guys. I'm reopening this since it's observed again. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5888/consoleFull (amp-jenkins-worker-05) [INFO] Running org.apache.spark.network.ChunkFetchIntegrationSuite [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 60.109 s <<< FAILURE! - in org.apache.spark.network.ChunkFetchIntegrationSuite [ERROR] fetchBothChunks(org.apache.spark.network.ChunkFetchIntegrationSuite) Time elapsed: 60.017 s <<< FAILURE! java.lang.AssertionError: Timeout getting response from the server at org.apache.spark.network.ChunkFetchIntegrationSuite.fetchChunks(ChunkFetchIntegrationSuite.java:176) at org.apache.spark.network.ChunkFetchIntegrationSuite.fetchBothChunks(ChunkFetchIntegrationSuite.java:210)
            sanket991 Sanket Reddy added a comment -

            I had a few observations regarding this test suite...

            When i run it on mac

            $ sysctl hw.physicalcpu hw.logicalcpu
            hw.physicalcpu: 4
            hw.logicalcpu: 8 

            I dont see the issue. This i think has got to do with io.serverThreads and the number of threads being used for handling chunk blocks and since there are multiple tests in the suite handling chunked blocks might require sufficient threads to handle the requests.

             

            On a vm i was able to reproduce this consistently

            -bash-4.1$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU('
            CPU(s): 4
            Thread(s) per core: 1
            Core(s) per socket: 1
            Socket(s): 4

            The root cause might be why it is actually failing is due to like zsxwing pointed out is due to https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88 sharing of worker threads.

            When I remove the static I no longer see the test failure.

             

            So do we really need it to be static?

            I dont think this requires a global declaration as these threads are only required on the shuffle server end and on the client TransportContext initialization i.e the Client don't initialize these threads. I assume for Shuffle Server there would be only one TransportContext object. So, I think this is fine to be an instance variable and I see no harm. Will do some testing again and if everything is fine will put up the pr...

             

            sanket991 Sanket Reddy added a comment - I had a few observations regarding this test suite... When i run it on mac $ sysctl hw.physicalcpu hw.logicalcpu hw.physicalcpu: 4 hw.logicalcpu: 8  I dont see the issue. This i think has got to do with io.serverThreads and the number of threads being used for handling chunk blocks and since there are multiple tests in the suite handling chunked blocks might require sufficient threads to handle the requests.   On a vm i was able to reproduce this consistently -bash-4.1$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU(' CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 The root cause might be why it is actually failing is due to like zsxwing pointed out is due to https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88  sharing of worker threads. When I remove the static I no longer see the test failure.   So do we really need it to be static? I dont think this requires a global declaration as these threads are only required on the shuffle server end and on the client TransportContext initialization i.e the Client don't initialize these threads. I assume for Shuffle Server there would be only one TransportContext object. So, I think this is fine to be an instance variable and I see no harm. Will do some testing again and if everything is fine will put up the pr...  
            sanket991 Sanket Reddy added a comment -

            Did some further digging

            How to reproduce
            ./build/mvn test -Dtest=org.apache.spark.network.RequestTimeoutIntegrationSuite,org.apache.spark.network.ChunkFetchIntegrationSuite -DwildcardSuites=None test
            furtherRequestsDelay Test within RequestTimeoutIntegrationSuite was holding onto worker references. The test does close the server context but since the threads are global and there is sleep of 60 secs to fetch a specific chunk within this test, it grabs on it and waits for the client to consume but however the test is testing for a request timeout and it times out after 10 secs, so the workers are just waiting there for the buffer to be consumed as per my understanding. I think we dont need this to be static as the server just initializes the TransportContext object once. I did some manual tests and it looks good

            sanket991 Sanket Reddy added a comment - Did some further digging How to reproduce ./build/mvn test -Dtest=org.apache.spark.network.RequestTimeoutIntegrationSuite,org.apache.spark.network.ChunkFetchIntegrationSuite -DwildcardSuites=None test furtherRequestsDelay Test within RequestTimeoutIntegrationSuite was holding onto worker references. The test does close the server context but since the threads are global and there is sleep of 60 secs to fetch a specific chunk within this test, it grabs on it and waits for the client to consume but however the test is testing for a request timeout and it times out after 10 secs, so the workers are just waiting there for the buffer to be consumed as per my understanding. I think we dont need this to be static as the server just initializes the TransportContext object once. I did some manual tests and it looks good
            sanket991 Sanket Reddy added a comment -

            Created a pr https://github.com/apache/spark/pull/23700 plz take a look thanks and let me know your thoughts

            sanket991 Sanket Reddy added a comment - Created a pr https://github.com/apache/spark/pull/23700  plz take a look thanks and let me know your thoughts
            srowen Sean R. Owen added a comment -

            Issue resolved by pull request 23700
            https://github.com/apache/spark/pull/23700

            srowen Sean R. Owen added a comment - Issue resolved by pull request 23700 https://github.com/apache/spark/pull/23700
            dongjoon Dongjoon Hyun added a comment -

            I update the JIRA title according to the last patch which is the real fix of the underlying issue.

            dongjoon Dongjoon Hyun added a comment - I update the JIRA title according to the last patch which is the real fix of the underlying issue.

            People

              sanket991 Sanket Reddy
              zsxwing Shixiong Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: