Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-131

Pulling messages of different topics from Kafka and merge them into one directory in HDFS

    XMLWordPrintableJSON

Details

    Description

      Is there a way to config the job so that messages of different topics in Kafka get pulled and merged into one single directory for all topics, instead of one directory per topic?

      For example, I now have a job that pulls data from Kafka, partitions them hourly and dumps onto HDFS, the directories in HDFS created by the job looks like:

      ```
      /gobblin/output-data/topics1/2016/02/01/...
      /gobblin/output-data/topics2/2016/02/01/...
      /gobblin/output-data/topics3/2016/02/01/...
      ```

      As you can see, the data are partitioned into different topics on top of dates, but what I really want is:

      ```
      /gobblin/output-data/2016/02/01/...
      ```

      where data in different topics are crunched together and partitioned by dates.

      Is there a easy to do this, thanks!

      Github Url : https://github.com/linkedin/gobblin/issues/769
      Github Reporter : hegu8
      Github Created At : 2016-02-26T20:21:16Z
      Github Updated At : 2016-03-04T17:22:49Z

      Comments


      stakiar wrote on 2016-03-02T04:59:01Z : hey @hegu8 so is your desired format as follows:

      ```
      /gobblin/output-data/2016/02/01/topic1
      /gobblin/output-data/2016/02/01/topic2
      /gobblin/output-data/2016/02/01/topic3
      ```

      I'm not sure this is currently possible, I will @zliu41 confirm though.

      Github Url : https://github.com/linkedin/gobblin/issues/769#issuecomment-191062237


      zliu41 wrote on 2016-03-02T06:07:04Z : @hegu8 currently it's not possible to put data from all topics together. To do so you'll need to modify `WriterUtils.getWriterFilePath` and add a 3rd option besides `TABLENAME` and `default`, for example `case EMPTY: return new Path(.);`. Then in your job config, set `writer.file.path.type=empty`. That will work.

      Github Url : https://github.com/linkedin/gobblin/issues/769#issuecomment-191081515

      Attachments

        Activity

          People

            shirshanka Shirshanka Das
            abti Abhishek Tiwari
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: