Description
Using Hadoop 3 it is not allowed to have multiple dependencies with same file names on the list of mapreduce.job.cache.files.
The issue occurs when I have the same file name on multiple sharelib folders and/or my application's lib folder. This can be avoided but not easy all the time.
I suggest to remove the duplicates from this list.
A quick workaround for the source code in JavaActionExecutor is like:
removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.files"); removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.archives"); ...... private void removeDuplicatedDependencies(JobConf conf, String key) { final Map<String, String> nameToPath = new HashMap<>(); StringBuilder uniqList = new StringBuilder(); for(String dependency: conf.get(key).split(",")) { final String[] arr = dependency.split("/"); final String dependencyName = arr[arr.length - 1]; if(nameToPath.containsKey(dependencyName)) { LOG.warn(dependencyName + " [" + dependency + "] is already defined in " + key + ". Skipping..."); } else { nameToPath.put(dependencyName, dependency); uniqList.append(dependency).append(","); } } uniqList.setLength(uniqList.length() - 1); conf.set(key, uniqList.toString()); }
Other way is to eliminate the deprecated org.apache.hadoop.filecache.DistributedCache.
I am going to have a deeper understanding how we should use distributed cache and all the comments are welcome.
Attachments
Attachments
Issue Links
- is blocked by
-
OOZIE-3219 Cannot compile with hadoop 3.1.0
- Closed
- is related to
-
MAPREDUCE-4493 Distibuted Cache Compatability Issues
- Closed
- relates to
-
MAPREDUCE-4503 Should throw InvalidJobConfException if duplicates found in cacheArchives or cacheFiles
- Closed