Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
-
None
-
None
Description
All of the following addresses are failing:
nutch-user@nutch.apache.org
nutch-user-subscribe@nutch.apache.org
nutch-user-subscribe@lucene.apache.org
For the last one, the mailer daemon said
"This mailing list has moved to user at nutch.apache.org."
Below is the message I tried to send:
Hi people,
I've been banging my head against this problem for two days now.
Simply, I want to add a field with the value of a given meta tag.
I've been trying the parse-xml plugin, but that seems that it doesn't
work with version 1.0. I've tried the code at
http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
and it hasn't worked. I don't even know why. I don't even know if my
plugin is being used... or even looked for! Nutch seems to have a
infuriating "Fail silently" policy for plugins. I put a
System.exit(1) in my filters just to see if my code is even being
encountered. It has not in spite of my config telling it to.
Here's my config:
nutch-site.xml
...
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata</value>
</property>
...
parse-plugins.xml
...
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
<plugin id="metadata" />
</mimeType>
<mimeType name="text/html">
<plugin id="parse-html" />
<plugin id="metadata" />
</mimeType>
<mimeType name="text/sgml">
<plugin id="parse-html" />
<plugin id="metadata" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="metadata" />
<plugin id="feed" />
</mimeType>
...
<alias name="metadata"
extension-id="com.example.website.nutch.parsing.MetaTagExtractorParseFilter"
/>
...
I've also copied the plugin.xml and jar from my build/metadata to the
plugins root dir.
Nonetheless, Nutch runs and puts data in solr for me. Afaik, Nutch is
completely unaware of my plugin despite my config options. Is the
some other place I need to tell Nutch to use my plugin? Is there some
other approach to do this without having to write a plugin? This does
seem like a lot of work to simply get a meta tag into a field. Any
help would be appreciated.
Sincerely,
John Sherwood