Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4567

GetFileBlockLocations should return the NetworkTopology information of the machines that hosts those blocks

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.20.0
    • None
    • None
    • Incompatible change, Reviewed
    • Changed GetFileBlockLocations to return topology information for nodes that host the block replicas.

    Description

      MultiFileInputFormat and FileInputFormat should use block locality information to construct splits.

      Attachments

        1. dfsRackLocation3.patch
          6 kB
          Dhruba Borthakur
        2. dfsRackLocation2.patch
          6 kB
          Dhruba Borthakur
        3. dfsRackLocation.patch
          5 kB
          Dhruba Borthakur

        Issue Links

          Activity

            FileSystem.getFileBlockLocation also returns the rack location of the blocks that are returned.

            dhruba Dhruba Borthakur added a comment - FileSystem.getFileBlockLocation also returns the rack location of the blocks that are returned.

            Review comments welcome.

            dhruba Dhruba Borthakur added a comment - Review comments welcome.
            hadoopqa Hadoop QA added a comment -

            +1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12393204/dfsRackLocation.patch
            against trunk revision 709609.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 3 new or modified tests.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 findbugs. The patch does not introduce any new Findbugs warnings.

            +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

            +1 core tests. The patch passed core unit tests.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/testReport/
            Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393204/dfsRackLocation.patch against trunk revision 709609. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3519/console This message is automatically generated.

            This looks good. The only debatable point is whether to introduce a new 'racks' variable in BlockLocations or to just prefix the network topology in the hosts variable itself. Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case). However, having a separate racks variable does makes it easier to handle cases where there is no topology information available, just a simple check on the racks variable would do instead of adding logic during the parsing of host names. I am fine with either approach.

            jothipn Jothi Padmanabhan added a comment - This looks good. The only debatable point is whether to introduce a new 'racks' variable in BlockLocations or to just prefix the network topology in the hosts variable itself. Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case). However, having a separate racks variable does makes it easier to handle cases where there is no topology information available, just a simple check on the racks variable would do instead of adding logic during the parsing of host names. I am fine with either approach.

            Thanks for the review comments. I made it a separate variable so that the API clearly identifies the hostname and the network topology.

            >Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case

            I missed your point. Can you pl explain it in greater detal? Thanks.

            dhruba Dhruba Borthakur added a comment - Thanks for the review comments. I made it a separate variable so that the API clearly identifies the hostname and the network topology. >Having a separate racks variable implies that there is an implicit requirement that the the ordering of hosts and the racks are identical (which is true in this case I missed your point. Can you pl explain it in greater detal? Thanks.

            Sorry, let me try to explain it better. Let us assume that the replication factor is 3 and we have one block. GetFileBlockLocation would return in BlockLocations[0].hosts

            {h1, h2, h3}

            and in BlockLocations[0].racks

            {r1,r1,r2}

            where h1 and h2 are in r1 and h3 is in r2. The topologyPaths for the different hosts are determined by their relative positions in their arrays. The topologyPath for h1, which is in index 0 of hosts array, is r1 (index 0 in the racks array). An alternative could have just been to return /r1/h1, /r1/h2 and /r2/h3 in the hosts. That way, if somebody wants both the host and rack information, they do not need to construct it by reading the same index in two arrays. Makes sense?

            jothipn Jothi Padmanabhan added a comment - Sorry, let me try to explain it better. Let us assume that the replication factor is 3 and we have one block. GetFileBlockLocation would return in BlockLocations [0] .hosts {h1, h2, h3} and in BlockLocations [0] .racks {r1,r1,r2} where h1 and h2 are in r1 and h3 is in r2. The topologyPaths for the different hosts are determined by their relative positions in their arrays. The topologyPath for h1, which is in index 0 of hosts array, is r1 (index 0 in the racks array). An alternative could have just been to return /r1/h1, /r1/h2 and /r2/h3 in the hosts. That way, if somebody wants both the host and rack information, they do not need to construct it by reading the same index in two arrays. Makes sense?

            Got it. Actually the queston you are raising is whether BlockLocation.getTopologyPaths() should return /r1 or should it return /r1/h1. I would prefer that this new method return /r1/h1. If somebody wants only the hostname, they can continue to use the existing method BlockLocation.getHostName().

            dhruba Dhruba Borthakur added a comment - Got it. Actually the queston you are raising is whether BlockLocation.getTopologyPaths() should return /r1 or should it return /r1/h1. I would prefer that this new method return /r1/h1. If somebody wants only the hostname, they can continue to use the existing method BlockLocation.getHostName().

            BlockLocation.getTopologyPath() returns the full network location of each of the hosts. Th last component of each path is the host:port.

            dhruba Dhruba Borthakur added a comment - BlockLocation.getTopologyPath() returns the full network location of each of the hosts. Th last component of each path is the host:port.

            GetTopologyPaths returning /r1/h1 looks good to me as well.

            jothipn Jothi Padmanabhan added a comment - GetTopologyPaths returning /r1/h1 looks good to me as well.

            Patch looks good. Is the

             System.out.println("XXX" +  topologyPaths[j]); 

            in TestReplication.java intentional or did it just slip in

            jothipn Jothi Padmanabhan added a comment - Patch looks good. Is the System .out.println( "XXX" + topologyPaths[j]); in TestReplication.java intentional or did it just slip in

            Thanks for catching the debug print. I have been making this mistake too often in the last few days

            dhruba Dhruba Borthakur added a comment - Thanks for catching the debug print. I have been making this mistake too often in the last few days
            hadoopqa Hadoop QA added a comment -

            +1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12393321/dfsRackLocation3.patch
            against trunk revision 711350.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 3 new or modified tests.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 findbugs. The patch does not introduce any new Findbugs warnings.

            +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

            +1 core tests. The patch passed core unit tests.

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/testReport/
            Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393321/dfsRackLocation3.patch against trunk revision 711350. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3528/console This message is automatically generated.

            +1. Looks good.

            jothipn Jothi Padmanabhan added a comment - +1. Looks good.

            GetFileBlockLocations returns the NetworkTopology information of the machines on where the blocks reside.

            dhruba Dhruba Borthakur added a comment - GetFileBlockLocations returns the NetworkTopology information of the machines on where the blocks reside.

            I just committed this.

            dhruba Dhruba Borthakur added a comment - I just committed this.

            Edit release note for publication.

            chansler Robert Chansler added a comment - Edit release note for publication.

            People

              dhruba Dhruba Borthakur
              dhruba Dhruba Borthakur
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: