[PHOENIX-5313] All mappers grab all RegionLocations from .META - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.15.0, 5.1.0, 4.14.3
Component/s: None
Labels:
- phoenix-hardening

Description

Phoenix's MapReduce integration lives in PhoenixInputFormat. It implements getSplits by calculating a QueryPlan for the provided SELECT query, and each split gets a mapper. As part of this QueryPlan generation, we grab all RegionLocations from .META

In PhoenixInputFormat:getQueryPlan:

 // Initialize the query plan so it sets up the parallel scans
 queryPlan.iterator(MapReduceParallelScanGrouper.getInstance());

In MapReduceParallelScanGrouper.getRegionBoundaries()

return context.getConnection().getQueryServices().getAllTableRegions(tableName);

This is fine.

Unfortunately, each mapper Task spawned by the job will go through this same exercise. It will pass a MapReduceParallelScanGrouper to queryPlan.iterator(), which I believe is eventually causing getRegionBoundaries to get called when the scans are initialized in the result iterator.

Since HBase 1.x and up got rid of .META prefetching and caching within the HBase client, that means that not only will each Job make potentially thousands of calls to .META, potentially thousands of Tasks will each make potentially thousands of calls to .META.

We should get a QueryPlan and setup the scans without having to read all RegionLocations, either by using the mapper's internal knowledge of its split key range, or by serializing the query plan from the client and sending it to the mapper tasks for use there.

Note that MapReduce tasks over snapshots are not affected by this, because region locations are stored in the snapshot manifest.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PHOENIX-5313-4.x-HBase-1.3.patch
20/Jun/19 19:53
22 kB
Chinmay Kulkarni
PHOENIX-5313-v1.patch
19/Jun/19 23:37
22 kB
Chinmay Kulkarni

Issue Links

relates to

PHOENIX-5362 Mappers should use the queryPlan from the driver rather than regenerating the plan

Open

links to

GitHub Pull Request #521

Activity

People

Assignee:: Chinmay Kulkarni

Reporter:: Geoffrey Jacoby

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 31/May/19 23:32

Updated:: 27/Aug/19 19:02

Resolved:: 20/Jun/19 19:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m