Details
Description
There many issues about missing date format:
NUTCH-871
NUTCH-912
NUTCH-1015
The data formats can be diverse, so why not move those data formats to a extra config file?
I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.
public void setConf(Configuration conf) { this.conf = conf; MIME = new MimeUtil(conf); URL res = conf.getResource("date-styles.txt"); if(res==null){ LOG.error("Can't find resource: date-styles.txt"); }else{ try { List lines = FileUtils.readLines(new File(res.getFile())); for (int i = 0; i < lines.size(); i++) { String dateStyle = (String) lines.get(i); if(StringUtils.isBlank(dateStyle)){ lines.remove(i); i--; continue; } dateStyle=StringUtils.trim(dateStyle); if(dateStyle.startsWith("#")){ lines.remove(i); i--; continue; } lines.set(i, dateStyle); } dateStyles = new String[lines.size()]; lines.toArray(dateStyles); } catch (IOException e) { LOG.error("Failed to load resource: date-styles.txt"); } } }
Then parse "lastModified" like this(sample):
private long getTime(String date, String url) { ...... Date parsedDate = DateUtils.parseDate(date, dateStyles); time = parsedDate.getTime(); ...... return time; }
This path also contains the "path" of NUTCH-1140.
Find more details in the patch file.
Attachments
Attachments
Issue Links
- is depended upon by
-
NUTCH-1015 MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
- Closed
- links to