Uploaded image for project: 'Commons Compress'
  1. Commons Compress
  2. COMPRESS-623

make ZipFile's getRawInputStream usable when local headers are not read

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.23.0
    • None
    • None

    Description

      I have a somewhat odd use case with gigabytes of ZIP files, each with thousands of documents (on comparatively slow, network drives). We need to restructure these ZIPs without the need to recompress files.

      The above turns out to work almost perfectly with raw-data copying ZipFile offers but empirical tests showed a major slowdown in the initial opening of zip files, linked to multiple reads/seeks for local file headers. If an option is passed to ignore those headers, raw streams are inaccessible.

      I've taken a look at the code and the code in getRawInputStream could basically do the same thing that getInputStream does - lazily load the missing offset via getDataOffset(ZipEntry). In fact, getInputStream could just call getRawInputStream directly, which avoids some code duplication. 

      I see speedups for opening and copying random raw streams in the order of 3-4x and all the current tests pass. I filed a PR at github - happy to discuss it there.

      https://github.com/apache/commons-compress/pull/306

      Attachments

        Activity

          People

            Unassigned Unassigned
            dweiss Dawid Weiss
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 20m
                2h 20m