[HIVE-7223] Support generic PartitionSpecs in Metastore partition-functions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0, 0.13.0
Fix Version/s: 0.14.0
Component/s: HCatalog, Metastore
Labels:
- TODOC14

Description

Currently, the functions in the HiveMetaStore API that handle multiple partitions do so using List<Partition>. E.g.

public List<Partition> listPartitions(String db_name, String tbl_name, short max_parts);
public List<Partition> listPartitionsByFilter(String db_name, String tbl_name, String filter, short max_parts);
public int add_partitions(List<Partition> new_parts);

Partition objects are fairly heavyweight, since each Partition carries its own copy of a StorageDescriptor, partition-values, etc. Tables with tens of thousands of partitions take so long to have their partitions listed that the client times out with default hive.metastore.client.socket.timeout. There is the additional expense of serializing and deserializing metadata for large sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic should help in this regard.

In a date-partitioned table, all sub-partitions for a particular date are likely (but not expected) to have:

The same base directory (e.g. /feeds/search/20140601/)
Similar directory structure (e.g. /feeds/search/20140601/[US,UK,IN])
The same SerDe/StorageHandler/IOFormat classes
Sorting/Bucketing/SkewInfo settings

In this “most likely” scenario (henceforth termed “normal”), it’s possible to represent the partition-list (for a date) in a more condensed form: a list of LighterPartition instances, all sharing a common StorageDescriptor whose location points to the root directory.

We can go one better for the add_partitions() case: When adding all partitions for a given date, the “normal” case affords us the ability to specify the top-level date-directory, where sub-partitions can be inferred from the HDFS directory-path.

These extensions are hard to introduce at the metastore-level, since partition-functions explicitly specify List<Partition> arguments. I wonder if a PartitionSpec interface might help:

public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ; 
public int add_partitions( PartitionSpec new_parts ) throws … ;

where the PartitionSpec looks like:

public interface PartitionSpec {
        public List<Partition> getPartitions();
        public List<String> getPartNames();
        public Iterator<Partition> getPartitionIter();
        public Iterator<String> getPartNameIter();
}

For addPartitions(), an HDFSDirBasedPartitionSpec class could implement PartitionSpec, store a top-level directory, and return Partition instances from sub-directory names, while storing a single StorageDescriptor for all of them.

Similarly, list_partitions() could return a List<PartitionSpec>, where each PartitionSpec corresponds to a set or partitions that can share a StorageDescriptor.

By exposing iterator semantics, neither the client nor the metastore need instantiate all partitions at once. That should help with memory requirements.

In case no smart grouping is possible, we could just fall back on a DefaultPartitionSpec which composes List<Partition>, and is no worse than status quo.

PartitionSpec abstracts away how a set of partitions may be represented. A tighter representation allows us to communicate metadata for a larger number of Partitions, with less Thrift traffic.

Given that Thrift doesn’t support polymorphism, we’d have to implement the PartitionSpec as a Thrift Union of supported implementations. (We could convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec sub-class.)

Thoughts?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-7223.1.patch
06/Aug/14 23:44
83 kB
Mithun Radhakrishnan
HIVE-7223.2.patch
08/Aug/14 17:28
1.99 MB
Mithun Radhakrishnan
HIVE-7223.3.patch
19/Aug/14 23:12
1.08 MB
Mithun Radhakrishnan
HIVE-7223.4.patch
03/Sep/14 00:21
1.09 MB
Mithun Radhakrishnan
HIVE-7223.5.patch
03/Sep/14 18:59
1.09 MB
Mithun Radhakrishnan

Issue Links

is depended upon by

HIVE-7576 Add PartitionSpec support in HCatClient API

Closed

is related to

HIVE-19337 Partition whitelist regex doesn't work (and never did)

Patch Available

Activity

People

Assignee:: Mithun Radhakrishnan

Reporter:: Mithun Radhakrishnan

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 12/Jun/14 05:03

Updated:: 01/Oct/19 22:07

Resolved:: 05/Sep/14 17:52