Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3341

make key-value separators in hadoop streaming fully configurable

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.19.0
    • None
    • None
    • Reviewed

    Description

      By default, hadoop streaming uses TAB as the separator in all places. However in some environments, user may want to use customized separators (e.g, ^A = \u0001).

      The separator logic in hadoop streaming is very convoluted. Here is a brief summary:

      InputFormat {
      KeyValueLineRecordReader.java:59:
      S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
      }

      Mapper {
      PipeMapper.java:88:
      S2: clientOut_.write('\t');

      Call mapper process

      PipeMapRed.java:124:
      S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
      PipeMapRed.java:128:
      this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
      }

      Reducer {
      PipeReducer.java:78:
      S4: clientOut_.write('\t');

      Call reducer process

      PipeMapRed.java:125:
      S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
      PipeMapRed.java:129:
      this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
      }

      OutputFormat {
      TextOuputFormat.java:112:
      S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
      }

      Short-cuts:
      1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
      2. For jobs with no reducers, The "Reducer" step is skipped.

      We need to make S3 and S4 configurable, possibly under the following names for conformity:
      stream.map.input.field.separator
      stream.reduce.input.field.separator

      Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!

      Maybe hadoop streaming can also provide a single option to override these 6 options.

      Attachments

        1. 3341-1.patch
          3 kB
          Zheng Shao
        2. 3341-2.patch
          6 kB
          Zheng Shao
        3. 3341-3.patch
          10 kB
          Zheng Shao
        4. 3341-4.patch
          15 kB
          Zheng Shao
        5. 3341-5.patch
          22 kB
          Zheng Shao

        Issue Links

          Activity

            People

              zshao Zheng Shao
              zshao Zheng Shao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: