[PDFBOX-4539] Cache CharsetDecoder - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.14
Fix Version/s: 2.0.16, 3.0.0 PDFBox
Component/s: Parsing
Labels:
- Optimization
- performance

Description

We were using PDFBox to parse and process a large number of PDFs, which could potentially contains thousands of pages in total, so performance mattered to us.

Thus, we'd like to suggest to cache the CharsetDecoder, which is currently instantiated on each call of `isValidUTF8(byte[])`.

Our suggestion in BaseParser.java

private static final CharsetDecoder csUTF_8 = Charsets.UTF_8.newDecoder();

/**
 * Returns true if a byte sequence is valid UTF-8.
 */
private boolean isValidUTF8(byte[] input)
{
    try
    {
        csUTF_8.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch (CharacterCodingException e)
    {
        return false;
    }
}

Attachments

Issue Links

relates to

PDFBOX-3347 COSName parsing doesn't handle ISO-8859-1 encoded bytes

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Jonathan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/May/19 11:31

Updated:: 28/Jun/19 04:39

Resolved:: 09/May/19 13:27