Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12619

Unable to instantiate ParseDocument processor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0-M1
    • 2.0.0-M2
    • Extensions
    • None
    • Python 3.11.6

    Description

      Trying to instantiate the Python processor ParseDocument, I get this error:

      2024-01-16 18:00:00,573 INFO org.apache.nifi.py4j.ExtensionManager Importing dependencies ['langchain', 'unstructured', 'unstructured-inference', 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'] for ParseDocument to /Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT using command ['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3', '-m', 'pip', 'install', '--no-cache-dir', '--target', '/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT', 'langchain', 'unstructured', 'unstructured-inference', 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx']
      2024-01-16 18:00:15,752 ERROR py4j.java_gateway There was an exception while executing the Python Proxy on the Python Side.
      Traceback (most recent call last):
        File "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line 2466, in _call_proxy
          return_value = getattr(self.pool[obj_id], method)(*params)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./python/framework/Controller.py", line 72, in downloadDependencies
          self.extensionManager.import_external_dependencies(processor_details, work_dir)
        File "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/ExtensionManager.py", line 511, in import_external_dependencies
          raise RuntimeError(f"Failed to import requirements for {class_name}: process exited with status code {result}")
      RuntimeError: Failed to import requirements for ParseDocument: process exited with status code CompletedProcess(args=['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3', '-m', 'pip', 'install', '--no-cache-dir', '--target', '/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT', 'langchain', 'unstructured', 'unstructured-inference', 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'], returncode=1) 

      If trying to run the pip command manually, I get

      no matches found: pdfminer.six[image] 

      Changing the required dependency to just pdfminer.six fixes the issue and I can instantiate the processor.

      However when trying to use it against a PDF file, I get:

      ModuleNotFoundError: No module named 'pikepdf' 
      ModuleNotFoundError: No module named 'pypdf'

      After adding the above dependencies, I get:

      pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? 

      Based on

      https://pdf2image.readthedocs.io/en/latest/installation.html

      It sounds like poppler would need to be installed separately. I did it with brew for my local instance. Probably worth adding this in the docs if doable. This is specified in the description of the processor.

      At this point I was able to use the processor to parse a PDF file.

       

      Attachments

        Issue Links

          Activity

            People

              pvillard Pierre Villard
              pvillard Pierre Villard
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m