Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1660

Patch for MCF HTML extractor connector

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • ManifoldCF next
    • HTML extractor
    • None

    Description

      Hello,

      Here is a patch for the HTML extractor connector regarding the text extraction with or without HTML stripping : patch_html_extractor_connector_02_12_2020.txt

      • Extraction of HTML code : I added a whitelist through the Jsoup cleaner to define what HTML elements are allowed to inforce the security. In the code I set to “relaxed”:

      This whitelist allows a full range of text and structural body HTML: a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul

      (more details here : https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed())

      A future improvement of the code would be to add a new parameter on the interface to choose what whitelist to choose.

       

      • Extraction of text with stripping HTML activated : we keep only text nodes : all HTML will be stripped (same thing as before). The change is the Jsoup pretty print option is now set to false to keep line breaks.

       

      Best regards

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            olivierfl Olivier Tavard
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: