Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The LOAD function using HBaseStorage has filter arguments you can use limit the working set for an MR job.
e.g.
blah = LOAD 'hbase://test' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:field1', '-loadKey -gte foo1 -lte foo1');
It would be really great if this could also be applied to filter statements within pig, where a filter statement within pig e.g.
blah2 = FILTER blah by key=foo1; or
blah2 = FILTER blah by key > foo1 and key < foo2;
would actually limit what is retrieved from hbase, so big has a smaller working set to perform MR on.