Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1260

Allow KeyValueWriter to support writing list of values also

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.0
    • None
    • None
    • Reviewed

    Description

      TEZ-1228 adds support to IFile for storing K,L<V>. Currently KeyValueWriter allows write of K,V

      public void write(Object key, Object value) throws IOException;

      We should add support for

      public void write(Object key, Iterable<Object> values) throws IOException;

      taking advantage of TEZ-1228. In few cases, pig unwraps key, list<values> and writes them as separate K,V pairs. This can avoid that overhead. That may enable us to even add something similar to hash based partial aggregation for join like what we do for groupby.

      Attachments

        1. TEZ-1260.2.patch
          10 kB
          Rajesh Balamohan
        2. TEZ-1260.1.patch
          5 kB
          Rajesh Balamohan

        Issue Links

          Activity

            bikassaha Bikas Saha added a comment -

            add something similar to hash based partial aggregation for join like what we do for groupby

            Can you please elaborate on that?

            bikassaha Bikas Saha added a comment - add something similar to hash based partial aggregation for join like what we do for groupby Can you please elaborate on that?

            https://wiki.apache.org/pig/PigHashBasedAggInMap - In simpler terms, For groupby if there is a combine plan instead of writing out K,V to output collector, we keep adding them to a hashmap and if the size hits a limit do aggregation and if the size still does not reduce then write out the contents of map to output collector which will do merge,spill to disk, etc. If we had the option to write out K,List<V> then we can collect them in hashmap as K,List<V> and write out when we reach memory limits for group by (even without combiner plan) and join. Since one level of grouping is done in hashmap, the sorting that has to be done by OnFileSortedOutput would be less. If hashmap could be integrated into OnFileSortedOutput itself or a wrapper output could do that, then it would make it easy for Pig and Hive. But generalizing it might need more thought as we do lot of memory calculations based on tuple size (APIs on Tuple) and decide when to spill.

            rohini Rohini Palaniswamy added a comment - https://wiki.apache.org/pig/PigHashBasedAggInMap - In simpler terms, For groupby if there is a combine plan instead of writing out K,V to output collector, we keep adding them to a hashmap and if the size hits a limit do aggregation and if the size still does not reduce then write out the contents of map to output collector which will do merge,spill to disk, etc. If we had the option to write out K,List<V> then we can collect them in hashmap as K,List<V> and write out when we reach memory limits for group by (even without combiner plan) and join. Since one level of grouping is done in hashmap, the sorting that has to be done by OnFileSortedOutput would be less. If hashmap could be integrated into OnFileSortedOutput itself or a wrapper output could do that, then it would make it easy for Pig and Hive. But generalizing it might need more thought as we do lot of memory calculations based on tuple size (APIs on Tuple) and decide when to spill.

            rohini: the API is something we can work with pre-0.5.0 release and iterate towards a better performing sorter in the 0.5.1 release.

            On that context, can you try running your GBY queries with "tez.runtime.sort.threads=2"?

            gopalv Gopal Vijayaraghavan added a comment - rohini : the API is something we can work with pre-0.5.0 release and iterate towards a better performing sorter in the 0.5.1 release. On that context, can you try running your GBY queries with "tez.runtime.sort.threads=2"?

            rajesh.balamohan,
            Can you do this for all existing KeyValueWriter like UnorderedPartitionedKVWriter, OnFileUnorderedKVOutput's writer, etc?

            rohini Rohini Palaniswamy added a comment - rajesh.balamohan , Can you do this for all existing KeyValueWriter like UnorderedPartitionedKVWriter, OnFileUnorderedKVOutput's writer, etc?

            gopalv,
            Sure will try out Pipelined sorter. Let me know if you need to try some other cases with it.

            rohini Rohini Palaniswamy added a comment - gopalv , Sure will try out Pipelined sorter. Let me know if you need to try some other cases with it.

            Pipelined sorter is the only one which currently enables IFile multi-kv mode dynamically (i.e from the rawcomparator equality during sorts).

            It was originally written as a fastpath for terasort like cases, so if this pans out, I will add a custom sort impl for key-grouping without full comparison sorting for group-by cases.

            gopalv Gopal Vijayaraghavan added a comment - Pipelined sorter is the only one which currently enables IFile multi-kv mode dynamically (i.e from the rawcomparator equality during sorts). It was originally written as a fastpath for terasort like cases, so if this pans out, I will add a custom sort impl for key-grouping without full comparison sorting for group-by cases.

            Added changes for BaseUnorderedPartitionedKVWriter, FileBasedKVWriter, OnFileUnorderedKVOutput.

            rajesh.balamohan Rajesh Balamohan added a comment - Added changes for BaseUnorderedPartitionedKVWriter, FileBasedKVWriter, OnFileUnorderedKVOutput.

            Thanks rohini and gopalv. Committed to master.

            commit 333d64434c5523d8885245e66974bc151d6d9f6a
            Author: Rajesh Balamohan <rbalamohan@apache.org>
            Date: Wed Jul 16 08:17:31 2014 +0530
            TEZ-1260. Allow KeyValueWriter to support writing list of values

            rajesh.balamohan Rajesh Balamohan added a comment - Thanks rohini and gopalv . Committed to master. commit 333d64434c5523d8885245e66974bc151d6d9f6a Author: Rajesh Balamohan <rbalamohan@apache.org> Date: Wed Jul 16 08:17:31 2014 +0530 TEZ-1260 . Allow KeyValueWriter to support writing list of values
            bikassaha Bikas Saha added a comment -

            Bulk close for jiras fixed in 0.5.0.

            bikassaha Bikas Saha added a comment - Bulk close for jiras fixed in 0.5.0.

            People

              rajesh.balamohan Rajesh Balamohan
              rohini Rohini Palaniswamy
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: