Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.2
-
None
Description
The microdata extractor calculates the subject of a triple as the hashCode() of the itemscope.
Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be collision-free. (Especially so in this case, since the ItemScope.hashCode() method is not written very well).
This means that two microdata items can accidentally be merged into one.
Here's the line that needs to be changed:
I recommend changing
subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
to
subject = RDFUtils.bnode();
We could also use itemScope.getItemId() if it's not null, even if it's not a URL. An example of one such id possible is:
urn:isbn:0-330-34032-8
Edit: according to the microdata spec, urn:isbn:0-330-34032-8 is an absolute URL. Since their definition of URL seems to correspond more closely to our definition of URI, we should be checking for absolute urls with URI.isAbsolute() rather than with URL.getProtocol() != null
Attachments
Issue Links
- blocks
-
ANY23-340 Any23 extraction does not pass Nutch plugin test
- Resolved
- links to