Description
CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks the opportunity to select on attributes like fetchTime and modifiedTime.
This includes a rudimentary date parser only supporting the yyyy-MM-dd'T'HH:mm:ss'Z' format:
Dump everything with a modifiedTime higher than 2016-03-20T00:00:00Z
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 2016-03-20T00:00:00Z)"
Dump everything that is an HTML file
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(Content_Type == 'text/html' || Content_Type == 'application/xhtml+xml')"
Keep in mind:
- Jexl doesn't allow a hyphen/minus in field identifier, they are transformed to underscores
- string literals must be in quotes, only surrounding qoute needs to be escaped by backslash