Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.8.0
Description
The fix for IMPALA-4172/IMPALA-3653 introduced a performance regression for loading tables that have many partitions with:
1. inconsistent HDFS path qualification or
2. a custom location (not under the table root dir)
For the first issue consider a table whose root path is at 'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1' will not be recognized as being a descendant of the table root dir by FileSystemUtil.isDescendentPath() because of how Path.equals() behaves, even if 'hdfs://localhost:8020' is the default filesystem.
Such partitions are incorrectly recognized as having a custom location and are treated specially. The treatment of such partitions is very inefficient, as show in the following code snippets:
HdfsTable.loadAllPartitions():
... if (!dirsToLoad.contains(partDir) && !FileSystemUtil.isDescendantPath(partDir, tblLocation)) { <--- this condition will fail // This partition has a custom filesystem location. Load its file/block // metadata separately by adding it to the list of dirs to load. dirsToLoad.add(partDir); } ...
HdfsTable.loadMetadataAndDiskIds() calls HdfsTable.loadBlockMetadata() once for every location:
private void loadMetadataAndDiskIds(List<Path> locations, HashMap<Path, List<HdfsPartition>> partsByPath) { LOG.info(String.format("Loading file and block metadata for %s partitions: %s", partsByPath.size(), getFullName())); for (Path location: locations) { loadBlockMetadata(location, partsByPath); } LOG.info(String.format("Loaded file and block metadata for %s partitions: %s", partsByPath.size(), getFullName())); }
HdfsTable.loadBlockMetadata():
... // Clear the state of partitions under dirPath since they are now updated based // on the current snapshot of files in the directory. for (Map.Entry<Path, List<HdfsPartition>> entry: partsByPath.entrySet()) { <--- partsByPath has an entry for every partition in the table Path partDir = entry.getKey(); if (!FileSystemUtil.isDescendantPath(partDir, dirPath)) continue; for (HdfsPartition partition: entry.getValue()) { partition.setFileDescriptors(new ArrayList<FileDescriptor>()); } } ...
As a result, it means that we will call isDescendentPath() roughly #numLocations * #totalPartitions times which can add up fast for tables with many partitions.
There are two issues to fix here:
1. The bug in recognizing partitions under the root table dir (for inconsistent qualification of table/partition locations)
2. The expensive loop for partitions with custom locations (even if legitimately custom)
Attachments
Issue Links
- is broken by
-
IMPALA-4172 Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)
- Resolved
- relates to
-
IMPALA-5042 Loading metadata for partitioned tables is slow due to usage of an ArrayList, potential 4x speedup
- Resolved