Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
None
-
None
Description
A bunch of CMS pages lack the <html> and </html> tags. I don't know the history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This is a spinoff from SOLR-7107).
Crawling lucene.apache.org with bin/post fails with 500 errors since Tika autodetect sees <head> as the first tag and believes it is XML
I think we're fine if all templates referred to from lib/path.pm have <html> tags added, and that none of them include eachother. Currently, core.html is both a top-page and also included from mirrors-core-latest-redir.html and mirrors-core-redir.html for some reason.
To reproduce the crawl errors:
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
We could in addition improve SimplePostTool to send a content-type hint to Tika. Update: The tool already does this