Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1256

[PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.7.0, 2.0.0
    • None
    • PDModel
    • N/A

    Description

      The attached patch restructures PDFStreamEngine to move the basic functionality of invoking callbacks for each operator in a stream into a parent class. The parent class knows nothing about the meaning of operators, it just invokes handlers with accumulated arguments whenever it encounters an operator. PDFStreamEngine retains all the "knowledge" of what those operators mean, the state of the graphics state stack, etc.

      The purpose of the change is to make it simpler and easier to use PDFBox's PDF stream processor/parser code without dealing with the full features of PDFStreamEngine with its built-in operator handlers, awareness of the graphics stack, etc when that functionality isn't required. Specifically, I needed to write a tool that copies a PDF stream, renaming resource references as it goes but otherwise leaving it unchanged. I wanted to handle all operators including future or unknown ones, and only needed to special-case a couple of them. PDFStreamEngine was poorly suited to that because it doesn't support a default handler fallback, tries to "understand" the stream, etc. Rather than write a new class that duplicated much of PDFStreamEngine I thought I'd try to factor the required functionality out, so others could use it too.

      The changes should be backward compatible with existing code that uses PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were required for the test suite to pass, text extraction tool to work, etc. Nonetheless, it's possible you'll only consider these changes for inclusion in PDFBox 2.0, in which case they can be cleaned up to remove some of the backward compatibility crap that's currently in them. Let me know.

      In terms of open issues or TODOs, the class naming could probably use work. I can't rename PDFStreamEngine or OperatorProcessor for backward compatibility reasons, so I've had to come up with more contrived names than I'd like.

      The logic of the changes is:

      • Move content stream argument accumulation and operator callback functionality into new PDFStreamProcessor class
      • Add support for a default (fallback) handler to PDFStreamProcessor so operators not explicitly matched may be handled
      • Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its existing methods though some are now inherited.
      • Deprecate the properties-map based configuration of PDFStreamEngine because it'll be fragile whenever more than one classloader is in use. Add PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent functionality using a type-safe, multi-classloader-safe HashMap of operator names to handler instances. This isn't added as a ctor override because operator handler registration/unregistration methods are not final (to preserve compatibility with PDFStreamEngine) and if overridden, they might use data from a not-yet-initialized derived class. If a ctor override is required then registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.
      • Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). Instances of this are bound to a particular PDFStreamEngine via the `context' property and they carry state when they don't have to. They're also an abstract class, so handlers can't extend any other class. OperatorProcessor based handlers continue to be supported just fine via a simple wrapper that's used automatically where required.
      • Introduce new PDFStreamProcessor.OperatorHandler interface to replace OperatorProcessor . It's a simple one-method interface that passes the PDFStreamProcessor as an argument, so application designers are free to choose whether to tie their OperationProcessorHandler implementations to PDFStreamProcessor instances or whether they want to re-use the same handler on many different processors. This change is useful for my app and removes unnecessary stateful API, but isn't strictly necessary and can be dropped while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of the API change, new-interface handlers are passed the original arguments array rather than a copy; if they want a copy of the arguments array they have to take it themselves, so that resources aren't wasted copying the array when handlers don't actually need it copied.
      • Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor implementations are wrapped in a helper that translates OperatorProcessorHandler interface usage to the usage required by OperatorProcessor. All the wrapper does is set the context (which PDFStreamEngine seems to do before every handler call) then pass a copy of the arguments array.

      I'm aware that this is a non-trivial change I'm proposing, but I think it significantly improves the API (especially once the BC stuff can be removed for PDFBox 2.0) and makes it easier to use this functionality.

      Prior patch in series (should be independent of this one): https://issues.apache.org/jira/browse/PDFBOX-1255
      Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263

      Attachments

        Activity

          People

            Unassigned Unassigned
            ringerc Craig Ringer
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: