Description
The DeduplicationJob may fail with an IllegalArgumentException on invalid percent encodings in URLs:
2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : attempt_1637669672674_0018_r_000193_0, Status : FAILED Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: "YR" at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) at org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211) ... Exception in thread "main" java.lang.RuntimeException: Crawl job did not succeed, job status:FAILED, reason: Task failed task_1637669672674_0018_r_000193 Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 killedReduces: 0
The IllegalArgumentException should be caught, logged and if only one of the two URLs with duplicated content is invalid, it should be flagged as duplicate while the valid URL "survives".
Attachments
Issue Links
- links to