Description
Links extracted by the LinkContentHandler contain the verbatim anchor text. This is usually fine but unfortunately many websites have the anchor text spread over multiple lines or have it indented with tabulators or spaces.
This patch adds a boolean option to LinkContentHandler with which whitespace collapsing can be toggled on or off. Default behaviour remains as-is and the API remains backward compatible.
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-1233 Rely on Tika for outlink extraction
- Closed