Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-98

HiveSerDeConverter. Write to ORC records duplication with queue.capacity=1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.13.0
    • None

    Description

      I'm using HiveSerDeConverter in my project to convert AVRO to ORC just like described here: [gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data](http://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/).
      The problem in duplicated records. I have that fancy `fork.record.queue.capacity=1` in my job file, but it seems like this parameter works partially. Let me explain it a bit more in-depth.

      What my project does: Connect to SFTP, read file list, extract all records from file inside ZIP archive, convert them to avro than to orc just like in the link above, write to output.
      I have sftp server on my localhost for testing purpose. Inside file that need to be read I have 7 data records. When I look to the output orc file, it contains:

      • third record
      • third record
      • third record
      • 4 record
      • 5 record
      • 6 record
      • 7 record

      SO, first three records is replaced by THIRD record, than output is normal. I tried hard to understand this behavior, so I reimplemented classes `HiveWritableHdfsDataWriterBuilder` and `HiveWritableHdfsDataWriter` in my package and added some dirty debug info into write method.

      ```
      public void write(Writable record) throws IOException

      { Preconditions.checkNotNull(record); log.info(MY WRITABLE: +record.toString()); this.writer.write(record); this.count.incrementAndGet(); }

      ```

      Also I have some debug info in my converter class and in DataWriterBuilder.

      ```
      public DataWriter<Writable> build() throws IOException {
      log.info(Data writer Builder start);
      ...
      ```

      Lets look at logs. Debug message: is message from my Converter class. MY WRITABLE is from DataWriter method write. What I see here: 3 debug messages appeared before DataWriterBuilder class was called. That means that 3 records were processed before writer started to work, so queue.capacity=1 doesn't really work here, that's why we have 3 same records. I still may be ridiculously wrong so I need help)

      ```
      gobblin.source.extractor.filebased.FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/01.zip:::file.csv
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
      org.apache.hadoop.hive.serde2.avro.AvroDeserializer: Adding new valid RRID :7c5adecd:155b9394a0f:-8000
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
      TSStatsConverter: ------------------------
      TSStatsConverter: Data writer Builder start
      TSStatsConverter: Preconditions ran succesfully
      TSStatsConverter: WRITER.WRITABLE.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
      TSStatsConverter: WRITER.OUTPUT.FORMAT.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
      TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
      gobblin.source.extractor.extract.sftp.SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/01.zip dest: ?? in 0 at 0.00
      gobblin.source.extractor.filebased.FileBasedExtractor: Unable to get file size. Will skip increment to bytes counter Failed to get size for file at path /home/remoteuser/Files/01.zip:::file.csv due to error No such file
      gobblin.source.extractor.filebased.FileBasedExtractor: Finished reading records from all files
      gobblin.runtime.Task: Extracted 7 data records
      ```

      Github Url : https://github.com/linkedin/gobblin/issues/1086
      Github Reporter : skyshineb
      Github Created At : 2016-07-05T07:34:05Z
      Github Updated At : 2016-10-29T06:39:15Z

      Comments


      stakiar wrote on 2016-07-09T03:12:27Z : In which file did you set `fork.record.queue.capacity=1`? Can you try setting it in both the `.job` file and `.properties` file?

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231511677


      skyshineb wrote on 2016-07-11T04:11:45Z : I set it in a `.pull` file where is my job config lays. Putting it to `.properties` file didn't help. I tested it earlier, logging this property in different classes, looked like it works fine.
      @sahilTakiar, I found some new info. I created new converter similar to the original one, but it uses new `SerDe` object for every record. So, no duplicated records. But problem still here. First three records is third record. Maybe this is problem with writer? Logs:

      ```
      FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/Terminal_Tethering01.zip:::Export Tethering Subscribers.csv
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
      AvroDeserializer: Adding new valid RRID :-4b494a8a:155c4244e69:-8000
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@61bbb8a4
      InstrumentedConverter: CONVERTED HASH:::1639692452
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@8e53cf0
      InstrumentedConverter: CONVERTED HASH:::149241072
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@1169d3a
      InstrumentedConverter: CONVERTED HASH:::18259258
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@35b0050
      InstrumentedConverter: CONVERTED HASH:::56295504
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@75b734cb
      InstrumentedConverter: CONVERTED HASH:::1974940875
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@212341d5
      InstrumentedConverter: CONVERTED HASH:::555958741
      TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
      InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@7f7305e8
      InstrumentedConverter: CONVERTED HASH:::2138244584
      SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/Terminal_Tethering01.zip dest: ?? in 0 at 0.00
      FsDataWriterBuilder: ------------------------
      FsDataWriterBuilder: Data writer Builder start
      FileBasedExtractor: Finished reading records from all files
      gobblin.runtime.Task: Extracted 7 data records
      gobblin.runtime.Task: Row quality checker finished with results:
      FsDataWriter: MY WRITABLE HASH: 1639692452
      FsDataWriter: MY WRITABLE HASH: 149241072
      FsDataWriter: MY WRITABLE HASH: 18259258
      FsDataWriter: MY WRITABLE HASH: 56295504
      FsDataWriter: MY WRITABLE HASH: 1974940875
      FsDataWriter: MY WRITABLE HASH: 555958741
      FsDataWriter: MY WRITABLE HASH: 2138244584
      gobblin.publisher.TaskPublisher: All components finished successfully, checking quality tests
      gobblin.publisher.TaskPublisher: All required test passed for this task passed.
      gobblin.publisher.TaskPublisher: Cleanup for task publisher executed successfully.
      gobblin.runtime.Fork-0: Committing data for fork 0 of task task_cmd_pre_post_1467874476697_0
      ```

      More debug info now) In destination file rows:

      ```
      2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
      2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
      2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
      ```

      look like:

      ```
      2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
      2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
      2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
      ```

      I tested it with files different in size, in content, but still h*ck this first three records. I will made new gobblin build this week, and will create clean job files and stuff. Maybe something will change.

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231638514


      abti wrote on 2016-07-13T11:44:24Z : @kSky7000 Can you please share your pull file, changes you made to any class and some sample data. I will try to reproduce it locally and debug.

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232331431


      skyshineb wrote on 2016-07-14T16:01:09Z : @abti Thank you very much.
      https://github.com/kSky7000/cuddly-pancake
      pull file, zip with data sample, sources.

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232710146


      skyshineb wrote on 2016-08-02T10:54:23Z : @abti Really sorry for bothering but I need help) Any news about this bug? Did you try to reproduce it?

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993


      abti wrote on 2016-08-03T05:14:44Z : Hi Nikolay,

      Apologies for dropping the ball on this. I missed to revisit this, and I am
      traveling right now. I will try to dig into this early next week.

      Regards
      Abhishek

      On Tue, Aug 2, 2016 at 4:24 PM, Nikolay Skovorodin <notifications@github.com

      > wrote:
      >
      > @abti https://github.com/abti Really sorry for bothering but I need
      > help) Any news about this bug? Did you try to reproduce it?
      >
      > —
      > You are receiving this because you were mentioned.
      > Reply to this email directly, view it on GitHub
      > https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993,
      > or mute the thread
      > https://github.com/notifications/unsubscribe-auth/AAEPe-DoqX5Uh-gg6Gr5ViVCzxTYwx42ks5qbyHfgaJpZM4JE2Or
      > .

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-237141426


      skyshineb wrote on 2016-08-08T11:10:34Z : @abti I found another intresting thing. Today I used example json-project with orc addition to it and achieved same result with first three records being duplicated. What do you think about it? For hadoop im using Hortonworks HDP 2.2

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-238206554


      jeffwang66 wrote on 2016-08-18T00:32:23Z : I hit the same issue. I used HiveSerDeConverter and HiveWritableHdfsDataWriterBuilder to write records to ORC files. I noticed that some records are duplicated and some others are missing even though I set fork.record.queue.capacity=1. Is there any plan to fix this issue? @abti

      @kSky7000 Did you find the solution for it?

      Thanks

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-240590679


      skyshineb wrote on 2016-08-24T15:24:50Z : @jeffwang66 Unfortunately i did not resolve this issue.

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-242104816


      ipostanogov wrote on 2016-10-29T06:39:15Z : I have the same issue. `fork.record.queue.capacity=1` does not help.

      Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-257074751

      Attachments

        Activity

          People

            Unassigned Unassigned
            abti Abhishek Tiwari
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: