[NUTCH-2787] CrawlDb JSON dump does not export metadata primitive data types correctly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.17
Fix Version/s: 1.17
Component/s: crawldb
Labels:
None
Environment:
Hide

Reproduced with:

commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD) Merge: e61a8a3b f971ca1b Author: Sebastian Nagel <snagel@apache.org> Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
Show
Reproduced with: commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD) Merge: e61a8a3b f971ca1b Author: Sebastian Nagel <snagel@apache.org> Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

Patch Info:

Patch Available

Description

To reproduce:

Activate scoring-depth plugin
Create a new crawldb from a seed URL:
Dump the crawldb as json
Look at the json

$ nutch inject crawl/crawldb seeds.txt
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
$ cat out/part-r-00000 | head -1 | python -m json.tool
{
    "url": "http://example.com/",
    "statusCode": 1,
    "statusName": "db_unfetched",
    "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
    "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
    "retriesSinceFetch": 0,
    "retryIntervalSeconds": 2592000,
    "retryIntervalDays": 30,
    "score": 1.0,
    "signature": "null",
    "metadata": {
        "_depth_": {},
        "_maxdepth_": {}
    }
}

KO => `_depth` and `maxdepth_` are not integer.

The fields are correct in the crawldb, as shown by a CSV dump:

$ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
$ cat out/part-r-00000 
Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
"http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||"

Code is here:

https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269

I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).

One fix might be to:

Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
Call that in the metadata conversion loop.

Attachments

Issue Links

links to

Github pull-request #531

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Patrick Mézard

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Jun/20 13:28

Updated:: 28/Jan/21 13:15

Resolved:: 10/Jun/20 18:35