Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.4.1
-
None
-
Reviewed
Description
When HiveInputFormat.getPartitionDescFromPath is called from CombineHiveInputFormat, it sometimes fails to return a matching partitionDesc which then causes an Exception down the line since the split doesn't have an inputFormatClassName.
The issue is that the path format used as the key in pathToPartitionInfo varies between stage - in the first stage it's the complete path as returned from the table definitions (eg. hdfs://server/path), and then in subsequent stages, it's the complete path with port (eg. hdfs://server:8020/path) of the result of the previous stage. This isn't a problem in HiveInputFormat since the directory you're looking up always uses the same format as the keys, but in CombineHiveInputFormat, we take that path and look up its children in the file system to get all the block information, and then use one of the returned paths to get the partition info – and that returned path does not include the port. So, in any stage after the first, we are looking for a path without the port, but all the keys in the map contain a port, so we don't find a match.
The attached patch may not be ideal – it doesn't fix the underlying problem of inconsistent path formats in pathToPartitionInfo – it just works around it by walking through the map and looking for a matching path rather than doing a hash lookup.