Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
In steaming, the map command usually expect to receive it's input uninterpreted – just as it is stored in DFS.
However, the split (the beginning and the end of the portion of data that goes to a single map task) is often important and is not "any line break".
Often the input consists of multi-line docments – e.g. in XML.
There should be a way to specify a pattern that separates logical records.
Existing "Streaming XML record reader" kind of provides this functionality. However, it is accepted that "Streaming XML" is a hack and needs to be replaced
Attachments
Issue Links
- is duplicated by
-
MAPREDUCE-606 Implement a binary input/output format for Streaming
- Resolved
-
MAPREDUCE-5018 Support raw binary data with Hadoop streaming
- Patch Available
-
HADOOP-3341 make key-value separators in hadoop streaming fully configurable
- Closed