XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.25.0, 2.0.0-M2
Fix Version/s: 1.27.0, 2.0.0-M4
Component/s: Extensions
Labels:
Environment:
any

Description

Summary

The ExecuteStreamCommand processor stores everything the invoked command writes to the error stream (stderr) into the FlowFile attribute execution.error.

When converting the bytes from the stream to a String, it interprets each individual byte as a Unicode codepoint. When reading only single bytes this effectively results in ISO-8859-1 (Latin-1).

Instead, it should use the system default encoding (like it already does for writing stdout if Output Destination Attribute is set) or use a configurable encoding (for both stdout and stderr).

Details

When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding issues are the responsibility of the flow designer, and NiFi has the ConvertCharacterSet processor to deal with those issues.

When writing to attributes, the API uses Java String objects, which are encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, processors receiving bytes have to interpret them using an encoding.

The ExecuteStreamCommand processor writes the output of the command (stdout) to the Output Destination Attribute (if set). To do that, it convertes bytes into a String using the system default encoding* by calling new String without an encoding argument:
https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499

When converting stderr to a String to write into the execution.error attribute, it uses this weird algorithm:
https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517
It reads individual bytes from the error stream (as ints) and casts them to chars. What Java does in this case is interpret the integer as a Unicode code point. For single bytes, this matches the ISO-8859-1 encoding. Instead, it should use the same decoding method as for stdout.

Reproduction steps

These steps are for a Linux environment, but can be adapted with a different executable for Windows.

Create the file /opt/nifi/data/encodingTest.sh (attached) with the following contents and make it executable:

#/bin/bash
echo "|out static: ÄÖÜäöüß"
echo "|error static: ÄÖÜäöüß" >&2echo "|out arg: $1"
echo "|error arg: $1" >&2echo "|out arg hexdump:"
printf '%s' "$1" | od -A x -t x1z -v
echo "|error arg hexdump:" >&2
printf '%s' "$1" | od -A x -t x1z -v >&2

The script writes identical data to both stdout and stderr. It contains non-ASCII characters to make the encoding issues visible.

Import the attached flow or create it manually:

Run the GenerateFlowFile processor once and observe the attributes of the FlowFile in the final queue:

The output attribute (stdout) is correctly decoded. The execution.error attribute (stderr) contains garbled text (UTF-8 bytes interpreted as ISO-8859-1 and reencoded in UTF-8).

*On the system default encoding

The system default encoding is a property of the JVM. It is UTF-8 on Linux, but Windows-1252 (or a different copepage depending on locale) in Windows environments. It can be overriden using the file.encoding JVM arg on startup.

Relying on the system default encoding is dangerous and can lead to subtle bugs, like the ones I previously reported (~~NIFI-12669~~ and ~~NIFI-12670~~).

In this case, it might make sense to use the system default encoding, as it concerns data passed between NiFi and another process that runs on the host system. Also, the ProcessBuilder class used the create the process always passes arguments in the system default encoding, and there doesn't seem a way to change that. This behavior should probably be documented.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-02-07-15-20-11-684.png
07/Feb/24 14:20
32 kB
René Zeidler
image-2024-02-07-15-14-54-841.png
07/Feb/24 14:14
25 kB
René Zeidler
image-2024-02-07-15-14-08-518.png
07/Feb/24 14:14
72 kB
René Zeidler
ExecuteStreamCommand_Encoding_Bug.json
07/Feb/24 14:10
10 kB
René Zeidler
encodingTest.sh
07/Feb/24 14:16
0.3 kB
René Zeidler

Issue Links

links to

https://github.com/apache/nifi/pull/8993

Activity

People

Assignee:: Jim Steinebrey

Reporter:: René Zeidler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Feb/24 14:33

Updated:: 27/Jun/24 14:47

Resolved:: 21/Jun/24 22:53