[MAPREDUCE-5050] Cannot find partition.lst in Terasort on Hadoop/Local File System - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 0.20.2
Fix Version/s: None
Component/s: examples
Labels:
None
Environment:

Cloudera VM CDH3u4, VMWare, Linux, Java SE 1.6.0_31-b04

Description

I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4).

I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory.

Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst, which is created with the parent directory.

  public int run(String[] args) throws Exception {
       LOG.info("starting");
      JobConf job = (JobConf) getConf();
>>  Path inputDir = new Path(args[0]);
>>  inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>>  Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
      URI partitionUri = new URI(partitionFile.toString() +
                               "#" + TeraInputFormat.PARTITION_FILENAME);
      TeraInputFormat.setInputPaths(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.setJobName("TeraSort");
      job.setJarByClass(TeraSort.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      job.setInputFormat(TeraInputFormat.class);
      job.setOutputFormat(TeraOutputFormat.class);
      job.setPartitionerClass(TotalOrderPartitioner.class);
      TeraInputFormat.writePartitionFile(job, partitionFile);
      DistributedCache.addCacheFile(partitionUri, job);
      DistributedCache.createSymlink(job);
      job.setInt("dfs.replication", 1);
      TeraOutputFormat.setFinalSync(job, true);
      JobClient.runJob(job);
      LOG.info("done");
      return 0;
  }

But in the configure method, the Path isn't created with the parent directory reference.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);
>>    Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }

I modified the code as follows, and now sorting portion of the Terasort test works using the
general file system. I think the above code is a bug.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);

  >>  Path[] inputPaths = TeraInputFormat.getInputPaths(job);
  >>  Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);

        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }

Attachments

Issue Links

duplicates

MAPREDUCE-5528 TeraSort fails with "can't read paritions file" - does not read partition file from distributed cache

Open

Activity

People

Assignee:: Unassigned

Reporter:: Matt Parker

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Mar/13 16:10

Updated:: 18/Mar/15 22:55