Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7114

SimplePostTool fails crawling lucene.apache.org due to missing <html> tag

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • SimplePostTool

    Description

      A bunch of CMS pages lack the <html> and </html> tags. I don't know the history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This is a spinoff from SOLR-7107).

      Crawling lucene.apache.org with bin/post fails with 500 errors since Tika autodetect sees <head> as the first tag and believes it is XML

      I think we're fine if all templates referred to from lib/path.pm have <html> tags added, and that none of them include eachother. Currently, core.html is both a top-page and also included from mirrors-core-latest-redir.html and mirrors-core-redir.html for some reason.

      To reproduce the crawl errors:

      bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
      

      We could in addition improve SimplePostTool to send a content-type hint to Tika. Update: The tool already does this

      Attachments

        Activity

          People

            janhoy Jan Høydahl
            janhoy Jan Høydahl
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: