Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Is there a way to config the job so that messages of different topics in Kafka get pulled and merged into one single directory for all topics, instead of one directory per topic?
For example, I now have a job that pulls data from Kafka, partitions them hourly and dumps onto HDFS, the directories in HDFS created by the job looks like:
```
/gobblin/output-data/topics1/2016/02/01/...
/gobblin/output-data/topics2/2016/02/01/...
/gobblin/output-data/topics3/2016/02/01/...
```
As you can see, the data are partitioned into different topics on top of dates, but what I really want is:
```
/gobblin/output-data/2016/02/01/...
```
where data in different topics are crunched together and partitioned by dates.
Is there a easy to do this, thanks!
Github Url : https://github.com/linkedin/gobblin/issues/769
Github Reporter : hegu8
Github Created At : 2016-02-26T20:21:16Z
Github Updated At : 2016-03-04T17:22:49Z
Comments
stakiar wrote on 2016-03-02T04:59:01Z : hey @hegu8 so is your desired format as follows:
```
/gobblin/output-data/2016/02/01/topic1
/gobblin/output-data/2016/02/01/topic2
/gobblin/output-data/2016/02/01/topic3
```
I'm not sure this is currently possible, I will @zliu41 confirm though.
Github Url : https://github.com/linkedin/gobblin/issues/769#issuecomment-191062237
zliu41 wrote on 2016-03-02T06:07:04Z : @hegu8 currently it's not possible to put data from all topics together. To do so you'll need to modify `WriterUtils.getWriterFilePath` and add a 3rd option besides `TABLENAME` and `default`, for example `case EMPTY: return new Path(.);`. Then in your job config, set `writer.file.path.type=empty`. That will work.
Github Url : https://github.com/linkedin/gobblin/issues/769#issuecomment-191081515