[CONNECTORS-1660] Patch for MCF HTML extractor connector - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: ManifoldCF next
Component/s: HTML extractor
Labels:
None

Description

Hello,

Here is a patch for the HTML extractor connector regarding the text extraction with or without HTML stripping : patch_html_extractor_connector_02_12_2020.txt

Extraction of HTML code : I added a whitelist through the Jsoup cleaner to define what HTML elements are allowed to inforce the security. In the code I set to “relaxed”:

This whitelist allows a full range of text and structural body HTML: a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul

(more details here : https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed())

A future improvement of the code would be to add a new parameter on the interface to choose what whitelist to choose.

Extraction of text with stripping HTML activated : we keep only text nodes : all HTML will be stripped (same thing as before). The change is the Jsoup pretty print option is now set to false to keep line breaks.

Best regards

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

patch_html_extractor_connector_11_12_2020.txt
11/Dec/20 17:33
1 kB
Olivier Tavard
patch_html_extractor_connector_02_12_2020.txt
02/Dec/20 16:50
1 kB
Olivier Tavard

Activity

People

Assignee:: Karl Wright

Reporter:: Olivier Tavard

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Dec/20 16:52

Updated:: 04/Oct/21 10:47