Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1874

PDFTextStripper.isParagraphSeparation(...)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.8, 2.0.0
    • 1.8.9, 2.0.0
    • Text extraction
    • Eclipse

    Description

      PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it finds Y text indentation.

      PROBLEM:
      I believe the issue is due to precision in the the following logic:

                  float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
                          lastPosition.getTextPosition().getYDirAdj());
                  float xGap = (position.getTextPosition().getXDirAdj()-
                          lastLineStartPosition.getTextPosition().getXDirAdj());
      
                  if(yGap > (getDropThreshold()*maxHeightForLine))
                  {
                              result = true;
      

      yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) has a precision to 100,000th. Resulting in the following comparison (example):
      16.018 > 16.018005
      which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".

      EFFECT OF THE PROBLEM:
      every line in the output is marked as "isParagraphStart = true" and "writeParagraphEnd() ... = true".
      I.E.

      NEW_LINE
      PARAGRAPH_START PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data NEW_LINE

      contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,|||NEW_LINE|||

      PARAGRAPH_END NEW_LINE
      PARAGRAPH_START strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the NEW_LINE

      COS Model). While it's possible to create any desired interactions with a PDF document using only these|||NEW_LINE|||

      PARAGRAPH_END NEW_LINE

      In the source PDF these lines appear as such:
      "PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data
      contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,
      strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the
      COS Model). While it's possible to create any desired interactions with a PDF document using only these"

      MY WORKAROUND:
      NOTE: there is a small performance hit with this workaround.

      	 float yGap = Math.abs(position.getTextPosition().getYDirAdj()
      	 - lastPosition.getTextPosition().getYDirAdj());
      	
      	 DecimalFormat df = new DecimalFormat("#.00");
      	 float yGapTruncated = Float.valueOf(df.format(yGap));
      	
      	 float newYVal = Float.valueOf(df.format(getDropThreshold()
      	 * maxHeightForLine));
      

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            YuriBurrows@gmail.com Yuri Burrows
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: