Details
-
New Feature
-
Status: Resolved
-
Trivial
-
Resolution: Duplicate
-
0.9.1u1
-
None
Description
I have written a code to be able to write to hdfs in this style:
roll(1000) [ escapedCustomDfs("hdfs://namenode/flume/file-%
But writing each event body as value in a SequenceFile (and NullWritable as key).
As a possible scenario, there would be a client sending flume events to a rpcSource. The body of each event is a thrift object. I have a sink that opens/closes a SequenceFile every X minutes, writes the body of each event (a thrift object converted to bytes) as BytesWritable as the Value of the SequenceFile and writes the key as NullWritable.
The purpose of this is that these SequenceFiles can be used as input in a MapReduce Job. Each BytesWritable value can be easily deserialized into a Thrift object (the initially generated by the client and set to the event body) in the map phase.
The plugins are (placed in the plugins folder in the zip):
com.cloudera.flume.handlers.hdfs.EscapedThriftSeqfileDfsSink
com.cloudera.flume.handlers.hdfs.ThriftSeqfileDfsSink
An example of node configuration:
node.name : tSource(38575) | roll(60000) { escapedThriftSeqfileDfsSink("hdfs://host.local/user/flume/file-%{rolltag}
")}
The zip contains also a couple of test:
1.-Php client that generates thrift objects and sends them via rpc to the source (The example objects are instances of TObject class, generated with thrift). This can be found in /test/php_rpc_writer.
2.-Java app that reads the sequenceFile generated by flume. The file has to be manually downloaded from hdfs and we must specify the path where we place it in the app. This app has the same TObject class than the php so it can deserialize the BytesWritable from the SequenceFile to the proper thrift object.
*The examples have hardcoded paths which should be properly set.
Attachments
Attachments
Issue Links
- duplicates
-
FLUME-635 Allow the compression codec to be specified in SequenceFiles
- Closed