[SOLR-2347] Use InputStream and not Reader for XML parsing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: 4.9, 6.0
Component/s: contrib - DataImportHandler
Labels:
None

Description

Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a "hint" to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. ~~SOLR-96~~ already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others).

This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-2347.patch
03/Jan/13 22:08
31 kB
James Dyer

Issue Links

is related to

SOLR-4096 DIH - FileDataSource & FieldReaderDataSource should default to UTF-8 charset

Closed

SOLR-96 Solr should support alternate charsets for XML update messages

Closed

is superceded by

SOLR-14783 Remove DIH from 9.0

Closed

Activity

People

Assignee:: Uwe Schindler

Reporter:: Uwe Schindler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Feb/11 15:31

Updated:: 29/Aug/20 20:21

Resolved:: 29/Aug/20 20:21