Description
Current implementation:
"This command takes a path to a crawldb as parameter and finds duplicates based on the signature. If several entries share the same signature, the one with the highest score is kept. If the scores are the same, then the fetch time is used to determine which one to keep with the most recent one being kept. If their fetch times are the same we keep the one with the shortest URL."
The order in which the main document is selected is currently not changeable. Therefore I think this option would be nice:
-compareOrder <score>,<fetchTime>,<urlLength>
I have written a patch on trunk (rev 1730516). I'm looking forward for any peer review.