Description
We have found an issue with Pig 0.8 and Pig 0.7 when using Multiquery optimization. It produces more number of part files than required. Please observe that the GROUP ALL is a dummy in this case.
record002 = LOAD 'samplepig001.in' AS (id:chararray,num:int); f_records002= FILTER record002 BY num!=50000; group01 = GROUP f_records002 ALL PARALLEL 1; STORE group01 INTO 'pig_out_direc_SET1'; set2 = FILTER f_records002 BY num!=200002; set2_Group = GROUP set2 ALL PARALLEL 1; STORE set2 INTO 'pig_out_direc_SET2'; set3 = FILTER f_records002 BY num!=100001; set3_Group= GROUP set3 BY id PARALLEL 40; --set3_Rec4= FILTER set3_Group by num!=5000000; STORE set3_Group INTO 'pig_out_direc_SET3';
When run in Pig 0.8 it results in the following output.
$ hadoop fs -ls /user/viraj/pig_out_direc_SET1
...
Found 40 items
rw------- 3 viraj users 0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00000
...
...
rw------ 3 viraj users 0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET1/part-r-00039$ hadoop fs -ls /user/viraj/pig_out_direc_SET2
Found 1 items
rw------ 3 viraj users 110 2010-11-13 02:08 /user/viraj/pig_out_direc_SET2/part-m-00000$ hadoop fs -ls /user/viraj/pig_out_direc_SET3
Found 40 items
rw------ 3 viraj users 0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00000
...
...
rw------ 3 viraj users 0 2010-11-13 02:09 /user/viraj/pig_out_direc_SET3/part-r-00039
Viraj