Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4875

Lazy load standard 14 fonts, only if needed

Details

    Description

      I am testing text extraction from PDF and profiling the execution.

      I found that the second biggest time consumer is the static code in Standard14Fonts that loads fonts from the pdf box jar.

      Looking at the code I realized we don't have to load all fonts statically, when the class loads.

      Not all PDFs need all fonts, so, if we lazy loaded them, only when needed, it will save some time and some memory.

      The memory part in particular would be important when running on a tablet or a phone, where the entire memory space of the app is 80M - 160M.

      Attachments

        1. PDFBOX-4875.patch
          11 kB
          Alfred

        Activity

          Faltiska Alfred added a comment -

          thank you!

          Faltiska Alfred added a comment - thank you!

          Sonar is satisfied so we're done here, thanks!

          tilman Tilman Hausherr added a comment - Sonar is satisfied so we're done here, thanks!

          Commit 1878805 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
          [ https://svn.apache.org/r1878805 ]

          PDFBOX-4875: synchronize on FONTS instead of on parameter

          jira-bot ASF subversion and git services added a comment - Commit 1878805 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1878805 ] PDFBOX-4875 : synchronize on FONTS instead of on parameter

          Commit 1878804 from Tilman Hausherr in branch 'pdfbox/branches/issue45'
          [ https://svn.apache.org/r1878804 ]

          PDFBOX-4875: synchronize on FONTS instead of on parameter

          jira-bot ASF subversion and git services added a comment - Commit 1878804 from Tilman Hausherr in branch 'pdfbox/branches/issue45' [ https://svn.apache.org/r1878804 ] PDFBOX-4875 : synchronize on FONTS instead of on parameter

          No problem. I'm running Sonar again to see if it is satisfied. My IDE (Netbeans, older version) still complains about double-checked locking but I thought this is exactly what we need.

          tilman Tilman Hausherr added a comment - No problem. I'm running Sonar again to see if it is satisfied. My IDE (Netbeans, older version) still complains about double-checked locking but I thought this is exactly what we need.

          Commit 1878803 from Tilman Hausherr in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1878803 ]

          PDFBOX-4875: synchronize on FONTS instead of on parameter

          jira-bot ASF subversion and git services added a comment - Commit 1878803 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1878803 ] PDFBOX-4875 : synchronize on FONTS instead of on parameter
          Faltiska Alfred added a comment -

          Ok, let's sync on FONTS.

          Sorry about the extra work...

          Faltiska Alfred added a comment - Ok, let's sync on FONTS. Sorry about the extra work...
          tilman Tilman Hausherr added a comment - - edited

          SonarCloud complains about the synchronization, doesn't like strings because they're pooled. However I think there is another problem: the code synchronizes on different objects. But we do a concurrent access on FONTS, which is not protected against concurrent access by different font names.

          I'd propose to synchronize on FONTS instead.

          tilman Tilman Hausherr added a comment - - edited SonarCloud complains about the synchronization, doesn't like strings because they're pooled. However I think there is another problem: the code synchronizes on different objects. But we do a concurrent access on FONTS, which is not protected against concurrent access by different font names. I'd propose to synchronize on FONTS instead.

          Commit 1878798 from Tilman Hausherr in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1878798 ]

          PDFBOX-4875: modify exception throwing sequence so that it is similar to 2.0

          jira-bot ASF subversion and git services added a comment - Commit 1878798 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1878798 ] PDFBOX-4875 : modify exception throwing sequence so that it is similar to 2.0

          Commit 1878797 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
          [ https://svn.apache.org/r1878797 ]

          PDFBOX-4875: lazy loading of standard 14 fonts, by Alfred Faltiska

          jira-bot ASF subversion and git services added a comment - Commit 1878797 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1878797 ] PDFBOX-4875 : lazy loading of standard 14 fonts, by Alfred Faltiska

          Commit 1878796 from Tilman Hausherr in branch 'pdfbox/branches/issue45'
          [ https://svn.apache.org/r1878796 ]

          PDFBOX-4875: lazy loading of standard 14 fonts, by Alfred Faltiska

          jira-bot ASF subversion and git services added a comment - Commit 1878796 from Tilman Hausherr in branch 'pdfbox/branches/issue45' [ https://svn.apache.org/r1878796 ] PDFBOX-4875 : lazy loading of standard 14 fonts, by Alfred Faltiska

          Commit 1878795 from Tilman Hausherr in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1878795 ]

          PDFBOX-4875: lazy loading of standard 14 fonts, by Alfred Faltiska

          jira-bot ASF subversion and git services added a comment - Commit 1878795 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1878795 ] PDFBOX-4875 : lazy loading of standard 14 fonts, by Alfred Faltiska
          Faltiska Alfred added a comment -

          I have updated the patch: 
          PDFBOX-4875.patch
          and uploaded it for review: 
          https://diffy.org/diff/uld8r5vjeoipbnih5x38a1yvi

          Faltiska Alfred added a comment - I have updated the patch:  PDFBOX-4875.patch and uploaded it for review:  https://diffy.org/diff/uld8r5vjeoipbnih5x38a1yvi

          I absolutely agree re "A bit here, a bit there". We have done this a lot for both space and speed. I'll commit your change after backporting your other change to the 2.* branch. In the meantime, please resubmit it, because the trunk has changed.

          tilman Tilman Hausherr added a comment - I absolutely agree re "A bit here, a bit there". We have done this a lot for both space and speed. I'll commit your change after backporting your other change to the 2.* branch. In the meantime, please resubmit it, because the trunk has changed.
          Faltiska Alfred added a comment -

          I don't want to annoy you guys insisting.
          I understand pdfbox for android is a different project.

          The memory savings are probably still important even we we're not targetting devices..
          I saw 3 memory related bugs only in the past 2 days: PDFBOX-4690PDFBOX-4718PDFBOX-4862
          If we can save some memory by only loading the fonts that are actually required, it will help a bit.

          A bit here, a bit there, it adds up ...

          Faltiska Alfred added a comment - I don't want to annoy you guys insisting. I understand pdfbox for android is a different project. The memory savings are probably still important even we we're not targetting devices.. I saw 3 memory related bugs only in the past 2 days:  PDFBOX-4690 ,  PDFBOX-4718 ,  PDFBOX-4862 If we can save some memory by only loading the fonts that are actually required, it will help a bit. A bit here, a bit there, it adds up ...
          Faltiska Alfred added a comment - - edited

          https://diffy.org/diff/uld8r5vjeoipbnih5x38a1yvi

           

          All existing tests pass, including the extra TextStripper input files.

           

          Faltiska Alfred added a comment - - edited https://diffy.org/diff/uld8r5vjeoipbnih5x38a1yvi   All existing tests pass, including the extra TextStripper input files.  
          Faltiska Alfred added a comment -

          Ok, good point.

          I will make the changes to avoid loading fonts multiple times because of multi threading. 

          I will submit for review, if you think it's not safe, we don't have to do this.

          Faltiska Alfred added a comment - Ok, good point. I will make the changes to avoid loading fonts multiple times because of multi threading.  I will submit for review, if you think it's not safe, we don't have to do this.

          PDFBOX can run multithreaded but not on the same document. But the standard 14 fonts are used by all so you're getting into dangerous waters.

          tilman Tilman Hausherr added a comment - PDFBOX can run multithreaded but not on the same document. But the standard 14 fonts are used by all so you're getting into dangerous waters.
          Faltiska Alfred added a comment -

          I thought PDFBox only works single threaded.

          I tried several times to run at least some parts in parallel and always ran into exceptions.

          Should I treat it as multi threaded?  

          Faltiska Alfred added a comment - I thought PDFBox only works single threaded. I tried several times to run at least some parts in parallel and always ran into exceptions. Should I treat it as multi threaded?  
          tilman Tilman Hausherr added a comment - - edited

          Re "tablet or phone", PDFBox for Android is a different project, they may or may not use the changes. So this can help only if such mobile devices are running windows.

          tilman Tilman Hausherr added a comment - - edited Re "tablet or phone", PDFBox for Android is a different project, they may or may not use the changes. So this can help only if such mobile devices are running windows.

          Be aware that this might be risky re: concurrency.

          tilman Tilman Hausherr added a comment - Be aware that this might be risky re: concurrency.
          Faltiska Alfred added a comment -

          I already have the code written.

          I will test it again and submit a patch soon.

          Faltiska Alfred added a comment - I already have the code written. I will test it again and submit a patch soon.

          People

            tilman Tilman Hausherr
            Faltiska Alfred
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1m
                1m
                Remaining:
                Remaining Estimate - 1m
                1m
                Logged:
                Time Spent - Not Specified
                Not Specified