Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.17
-
None
-
Reproduced with:
commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD) Merge: e61a8a3b f971ca1b Author: Sebastian Nagel <snagel@apache.org> Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
Reproduced with: commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD) Merge: e61a8a3b f971ca1b Author: Sebastian Nagel <snagel@apache.org> Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
-
Patch Available
Description
To reproduce:
- Activate scoring-depth plugin
- Create a new crawldb from a seed URL:
- Dump the crawldb as json
- Look at the json
$ nutch inject crawl/crawldb seeds.txt $ rm -rf out; nutch readdb crawl/crawldb -dump out -format json $ cat out/part-r-00000 | head -1 | python -m json.tool { "url": "http://example.com/", "statusCode": 1, "statusName": "db_unfetched", "fetchTime": "Thu Jun 04 15:19:02 CEST 2020", "modifiedTime": "Thu Jan 01 01:00:00 CET 1970", "retriesSinceFetch": 0, "retryIntervalSeconds": 2592000, "retryIntervalDays": 30, "score": 1.0, "signature": "null", "metadata": { "_depth_": {}, "_maxdepth_": {} } }
KO => `_depth` and `maxdepth_` are not integer.
The fields are correct in the crawldb, as shown by a CSV dump:
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
$ cat out/part-r-00000
Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
"http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||"
Code is here:
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269
I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).
One fix might be to:
- Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
- Call that in the metadata conversion loop.
Attachments
Issue Links
- links to