[DAFFODIL-2128] XML preamble encoding ignored when CLI unparsing with "xml" infoset type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: CLI
Labels:
None

Description

When using the CLI to unparse XML using the "xml" infoset type, we have the following code:

case "xml" => {
  val rdr = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(anyRef.asInstanceOf[Array[Byte]])))
  new XMLTextInfosetInputter(rdr)
}

In order to create the XMLTextInfosetInputter, we create an InputStreamReader, but we do not specify an encoding. This means the Java "file.encoding" system property will be used to decode this XML. So on machines where that property isn't UTF-8 (e.g. Windows), this can result in UTF-8 data in the XML not decoded correctly, which leads to incorrect unparsed data.

I believe Woodstox has the ability to inspect XML and determine the encoding based on the preamble, so we should just take advantage of that. So we should change the XMLTextInfosetInputter to accept an InputStream in the constructor instead of a Reader, and deprecate the Reader constructor.

Attachments

Activity

People

Assignee:: Steve Lawrence

Reporter:: Steve Lawrence

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/May/19 11:57

Updated:: 20/May/21 12:34

Resolved:: 16/May/19 17:32