[GOBBLIN-98] HiveSerDeConverter. Write to ORC records duplication with queue.capacity=1 - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: None
Labels:
- help
- wanted

External issue URL:
https://github.com/linkedin/gobblin/issues/1086

Description

I'm using HiveSerDeConverter in my project to convert AVRO to ORC just like described here: [gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data](http://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/).
The problem in duplicated records. I have that fancy `fork.record.queue.capacity=1` in my job file, but it seems like this parameter works partially. Let me explain it a bit more in-depth.

What my project does: Connect to SFTP, read file list, extract all records from file inside ZIP archive, convert them to avro than to orc just like in the link above, write to output.
I have sftp server on my localhost for testing purpose. Inside file that need to be read I have 7 data records. When I look to the output orc file, it contains:

third record
third record
third record
4 record
5 record
6 record
7 record

SO, first three records is replaced by THIRD record, than output is normal. I tried hard to understand this behavior, so I reimplemented classes `HiveWritableHdfsDataWriterBuilder` and `HiveWritableHdfsDataWriter` in my package and added some dirty debug info into write method.

```
public void write(Writable record) throws IOException

{ Preconditions.checkNotNull(record); log.info(MY WRITABLE: +record.toString()); this.writer.write(record); this.count.incrementAndGet(); }

```

Also I have some debug info in my converter class and in DataWriterBuilder.

```
public DataWriter<Writable> build() throws IOException {
log.info(Data writer Builder start);
...
```

Lets look at logs. Debug message: is message from my Converter class. MY WRITABLE is from DataWriter method write. What I see here: 3 debug messages appeared before DataWriterBuilder class was called. That means that 3 records were processed before writer started to work, so queue.capacity=1 doesn't really work here, that's why we have 3 same records. I still may be ridiculously wrong so I need help)

```
gobblin.source.extractor.filebased.FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/01.zip:::file.csv
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
org.apache.hadoop.hive.serde2.avro.AvroDeserializer: Adding new valid RRID :7c5adecd:155b9394a0f:-8000
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
TSStatsConverter: ------------------------
TSStatsConverter: Data writer Builder start
TSStatsConverter: Preconditions ran succesfully
TSStatsConverter: WRITER.WRITABLE.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
TSStatsConverter: WRITER.OUTPUT.FORMAT.CLASS=org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
TSStatsConverter: MY WRITABLE: org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@22012154
gobblin.source.extractor.extract.sftp.SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/01.zip dest: ?? in 0 at 0.00
gobblin.source.extractor.filebased.FileBasedExtractor: Unable to get file size. Will skip increment to bytes counter Failed to get size for file at path /home/remoteuser/Files/01.zip:::file.csv due to error No such file
gobblin.source.extractor.filebased.FileBasedExtractor: Finished reading records from all files
gobblin.runtime.Task: Extracted 7 data records
```

Github Url : https://github.com/linkedin/gobblin/issues/1086
Github Reporter : skyshineb
Github Created At : 2016-07-05T07:34:05Z
Github Updated At : 2016-10-29T06:39:15Z

Comments

stakiar wrote on 2016-07-09T03:12:27Z : In which file did you set `fork.record.queue.capacity=1`? Can you try setting it in both the `.job` file and `.properties` file?

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231511677

skyshineb wrote on 2016-07-11T04:11:45Z : I set it in a `.pull` file where is my job config lays. Putting it to `.properties` file didn't help. I tested it earlier, logging this property in different classes, looked like it works fine.
@sahilTakiar, I found some new info. I created new converter similar to the original one, but it uses new `SerDe` object for every record. So, no duplicated records. But problem still here. First three records is third record. Maybe this is problem with writer? Logs:

```
FileBasedExtractor: Will start downloading file: /home/remoteuser/Files/Terminal_Tethering01.zip:::Export Tethering Subscribers.csv
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
AvroDeserializer: Adding new valid RRID :-4b494a8a:155c4244e69:-8000
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@61bbb8a4
InstrumentedConverter: CONVERTED HASH:::1639692452
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@8e53cf0
InstrumentedConverter: CONVERTED HASH:::149241072
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@1169d3a
InstrumentedConverter: CONVERTED HASH:::18259258
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012351;123456789012351;366.00;244.00;122.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@35b0050
InstrumentedConverter: CONVERTED HASH:::56295504
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012345;123456789012345;363.00;242.00;121.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@75b734cb
InstrumentedConverter: CONVERTED HASH:::1974940875
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012346;123456789012346;363.00;242.00;121.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@212341d5
InstrumentedConverter: CONVERTED HASH:::555958741
TSStatsConverter: Debug message:: 2016-05-19 00:00:00;123456789012348;123456789012348;363.00;242.00;121.00
InstrumentedConverter: CONVERTED RECORD:::org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow@7f7305e8
InstrumentedConverter: CONVERTED HASH:::2138244584
SftpFsHelper: Transfer finished 1 src: /home/remoteuser/Files/Terminal_Tethering01.zip dest: ?? in 0 at 0.00
FsDataWriterBuilder: ------------------------
FsDataWriterBuilder: Data writer Builder start
FileBasedExtractor: Finished reading records from all files
gobblin.runtime.Task: Extracted 7 data records
gobblin.runtime.Task: Row quality checker finished with results:
FsDataWriter: MY WRITABLE HASH: 1639692452
FsDataWriter: MY WRITABLE HASH: 149241072
FsDataWriter: MY WRITABLE HASH: 18259258
FsDataWriter: MY WRITABLE HASH: 56295504
FsDataWriter: MY WRITABLE HASH: 1974940875
FsDataWriter: MY WRITABLE HASH: 555958741
FsDataWriter: MY WRITABLE HASH: 2138244584
gobblin.publisher.TaskPublisher: All components finished successfully, checking quality tests
gobblin.publisher.TaskPublisher: All required test passed for this task passed.
gobblin.publisher.TaskPublisher: Cleanup for task publisher executed successfully.
gobblin.runtime.Fork-0: Committing data for fork 0 of task task_cmd_pre_post_1467874476697_0
```

More debug info now) In destination file rows:

```
2016-05-19 00:00:00;123456789012347;123456789012347;7317.00;4878.00;2439.00
2016-05-19 00:00:00;123456789012352;123456789012352;732.00;488.00;244.00
2016-05-19 00:00:00;123456789012349;123456789012349;366.00;244.00;122.00
```

look like:

```
2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
2016-05-19 00:00:00 123456789012349 123456789012349 366.00 244.00 122.00
```

I tested it with files different in size, in content, but still h*ck this first three records. I will made new gobblin build this week, and will create clean job files and stuff. Maybe something will change.

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-231638514

abti wrote on 2016-07-13T11:44:24Z : @kSky7000 Can you please share your pull file, changes you made to any class and some sample data. I will try to reproduce it locally and debug.

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232331431

skyshineb wrote on 2016-07-14T16:01:09Z : @abti Thank you very much.
https://github.com/kSky7000/cuddly-pancake
pull file, zip with data sample, sources.

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-232710146

skyshineb wrote on 2016-08-02T10:54:23Z : @abti Really sorry for bothering but I need help) Any news about this bug? Did you try to reproduce it?

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993

abti wrote on 2016-08-03T05:14:44Z : Hi Nikolay,

Apologies for dropping the ball on this. I missed to revisit this, and I am
traveling right now. I will try to dig into this early next week.

Regards
Abhishek

On Tue, Aug 2, 2016 at 4:24 PM, Nikolay Skovorodin <notifications@github.com

> wrote:
>
> @abti https://github.com/abti Really sorry for bothering but I need
> help) Any news about this bug? Did you try to reproduce it?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> https://github.com/linkedin/gobblin/issues/1086#issuecomment-236870993,
> or mute the thread
> https://github.com/notifications/unsubscribe-auth/AAEPe-DoqX5Uh-gg6Gr5ViVCzxTYwx42ks5qbyHfgaJpZM4JE2Or
> .

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-237141426

skyshineb wrote on 2016-08-08T11:10:34Z : @abti I found another intresting thing. Today I used example json-project with orc addition to it and achieved same result with first three records being duplicated. What do you think about it? For hadoop im using Hortonworks HDP 2.2

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-238206554

jeffwang66 wrote on 2016-08-18T00:32:23Z : I hit the same issue. I used HiveSerDeConverter and HiveWritableHdfsDataWriterBuilder to write records to ORC files. I noticed that some records are duplicated and some others are missing even though I set fork.record.queue.capacity=1. Is there any plan to fix this issue? @abti

@kSky7000 Did you find the solution for it?

Thanks

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-240590679

skyshineb wrote on 2016-08-24T15:24:50Z : @jeffwang66 Unfortunately i did not resolve this issue.

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-242104816

ipostanogov wrote on 2016-10-29T06:39:15Z : I have the same issue. `fork.record.queue.capacity=1` does not help.

Github Url : https://github.com/linkedin/gobblin/issues/1086#issuecomment-257074751