Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.25.0, 2.0.0-M2
-
any
Description
Summary
The ExecuteStreamCommand processor stores everything the invoked command writes to the error stream (stderr) into the FlowFile attribute execution.error.
When converting the bytes from the stream to a String, it interprets each individual byte as a Unicode codepoint. When reading only single bytes this effectively results in ISO-8859-1 (Latin-1).
Instead, it should use the system default encoding (like it already does for writing stdout if Output Destination Attribute is set) or use a configurable encoding (for both stdout and stderr).
Details
When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding issues are the responsibility of the flow designer, and NiFi has the ConvertCharacterSet processor to deal with those issues.
When writing to attributes, the API uses Java String objects, which are encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, processors receiving bytes have to interpret them using an encoding.
The ExecuteStreamCommand processor writes the output of the command (stdout) to the Output Destination Attribute (if set). To do that, it convertes bytes into a String using the system default encoding* by calling new String without an encoding argument:
https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499
When converting stderr to a String to write into the execution.error attribute, it uses this weird algorithm:
https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517
It reads individual bytes from the error stream (as ints) and casts them to chars. What Java does in this case is interpret the integer as a Unicode code point. For single bytes, this matches the ISO-8859-1 encoding. Instead, it should use the same decoding method as for stdout.
Reproduction steps
These steps are for a Linux environment, but can be adapted with a different executable for Windows.
- Create the file /opt/nifi/data/encodingTest.sh (attached) with the following contents and make it executable:
The script writes identical data to both stdout and stderr. It contains non-ASCII characters to make the encoding issues visible.#/bin/bash
echo "|out static: ÄÖÜäöüß"
echo "|error static: ÄÖÜäöüß" >&2echo "|out arg: $1"
echo "|error arg: $1" >&2echo "|out arg hexdump:"
printf '%s' "$1" | od -A x -t x1z -v
echo "|error arg hexdump:" >&2
printf '%s' "$1" | od -A x -t x1z -v >&2
- Import the attached flow or create it manually:
- Run the GenerateFlowFile processor once and observe the attributes of the FlowFile in the final queue:
The output attribute (stdout) is correctly decoded. The execution.error attribute (stderr) contains garbled text (UTF-8 bytes interpreted as ISO-8859-1 and reencoded in UTF-8).
*On the system default encoding
The system default encoding is a property of the JVM. It is UTF-8 on Linux, but Windows-1252 (or a different copepage depending on locale) in Windows environments. It can be overriden using the file.encoding JVM arg on startup.
Relying on the system default encoding is dangerous and can lead to subtle bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670).
In this case, it might make sense to use the system default encoding, as it concerns data passed between NiFi and another process that runs on the host system. Also, the ProcessBuilder class used the create the process always passes arguments in the system default encoding, and there doesn't seem a way to change that. This behavior should probably be documented.