Listing the contents of a remote ZIP archive, without downloading the entire file

Why I needed to do this is a longer story, but this was a question I was looking for an answer to.

Initially it led me to the following SO question:

Is it possible to download just part of a ZIP archive (e.g. one file)?

Not exactly the same problem but close enough. The suggestion here is to mount the archive using HTTPFS and use normal zip tools. The part of the answer that caught my eye was this:

This way the unzip utility’s I/O calls are translated to HTTP range gets
https://stackoverflow.com/a/15321699

HTTP range requests are a clever way to get a web server to only send you parts of a file. It requires that the web server supports it though. You can check if this is the case with a simple curl command. Look for accept-ranges: bytes.

I’ve added a simple test archive, with some garbage content files, as a test subject here:

$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:01:29 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=d959acad2190d0ddf56823b10d6793c371618754489; expires=Tue, 18-May-21 14:01:29 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 0986e266210000d881823ae000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=mQ4KV6cFG5W5iRV%2FSdu5CQXBdMryWNtlCn8jA29dJC44M8Hl5ARNdhBrIKYrhLCdsT%2FbD8QN07HEYgtWDXnGyV%2BC%2BA2Vj6UTFTC6"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641e6ce9cf77d881-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400

This got me thinking if it might be possible, to construct some minimal set of requests, that only gets the part of the ZIP file containing information about its content.

I didn’t really know anything about the ZIP file format beforehand, so this might be trivial if you are already familiar, but as it turns out, ZIP files contain information about their contents in a data block at the end of the file called the Central Directory.

This means that it’s only this part of the archive that’s required in order to list out the content.

On-disk layout of a binary file format that is both the ZIP container and its relevance in the 64-bit format. — https://commons.wikimedia.org/wiki/File:ZIP-64_Internal_Layout.svg

HTTP range requests are specified by setting a header that has the form: Range: bytes=<from>-<to>, so that means if we can somehow get a hold of the byte offset of the Central Directory and how many bytes it takes up in size, we can issue a range request that should only carry the Central Directory in the response.

The offsets we need are both part of the End of central directory record (EOCD), another data block, which appears after the Central Directory, as the very last part of the ZIP archive. It has variable length, due to the option of including a comment as the last field of the record. If there’s no comment it should only be 22 bytes.

Back to square one. We have to solve the same problem to get just the EOCD, as we have for the Central Directory. Since the EOCD is at the very end of the archive, to corresponds to the Content-Length of the file. We can get that simply by issuing a HEAD request:

$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:45:22 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=dd56ae29f49cf9931ac1d5977926f61c01618757122; expires=Tue, 18-May-21 14:45:22 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 09870a92ce000010c1d6269000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=Ko46rGYqFfKG0A2iY93XNqjK7PSIca9m9AK5iX9bfUUYr0%2BzdzjMN1IJXQ%2Fn5zjj%2B96d2%2Bnaommr%2FOUaGrzKpqyUjaeme0HGvA1z"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641ead314d8710c1-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400

In the case of the test file; 51987 bytes. So far so good, but here’s where we have to cut some corners. Due to the comment part of the EOCD being variable length, we cannot know the proper from offset, so we’ll have to make a guess here, e.g. 100 bytes:

$ curl -s -O -H "Range: bytes=51887-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip
[~/Code/stand/by] (master)
$ hexdump test.zip
0000000 7c 60 dd 2f 7c 60 50 4b 01 02 15 03 14 00 08 00
0000010 08 00 5b 79 92 52 58 64 08 f4 05 28 00 00 00 28
0000020 00 00 0e 00 0c 00 00 00 00 00 00 00 00 40 a4 81
0000030 44 a1 00 00 72 61 6e 64 6f 6d 2d 62 79 74 65 73
0000040 2e 31 55 58 08 00 ba 2f 7c 60 dd 2f 7c 60 50 4b
0000050 05 06 00 00 00 00 05 00 05 00 68 01 00 00 95 c9
0000060 00 00 00 00
0000064

Since we’ve most likely have preceding bytes that we don’t care about, we need to scan the response until we find the EOCD signature, 0x06054b50, (in network byte order). From there extracting the offset and size for the Central Directory is straightforward. In the case above we find it at 0x0000c995, with a size of 0x00000168 (or 51605 and 360 base 10 respectively).

One more curl command to get the Central Directory:

$ curl -s -O -H "Range: bytes=51605-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip

Notice I’m including the EOCD here, but that’s just so we can use zipinfo on the file. Really to would be 51965.

Here’s a zipinfo of the original file:

$ zipinfo test.zip
Archive:  test.zip
Zip file size: 51987 bytes, number of entries: 5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed:  0.0%

And here it is of the stripped one:

$ zipinfo test.zip
Archive:  test.zip
Zip file size: 382 bytes, number of entries: 5
error [test.zip]:  missing 51605 bytes in zipfile
  (attempting to process anyway)
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed:  0.0%

Ruby implementation

A bunch of curl commands is all well and good, but in my case I actually needed it as part of another script, which was written in Ruby.

Here’s a utility function, which basically does the same thing as above, and returns a list of filenames:

EOF

Obviously this whole dance might be a bit of an over-complication for smaller zip files, where you might as well just download the whole thing, and use normal tools to list out the content, but for very large archives maybe there’s something to this trick after all.

If you know of a better or easier way to accomplish this task, feel free to leave a comment or ping me on Twitter.

Over and out.

Addendum

After posting this, it’s been pointed out to me, that the initial HEAD request is redundant, since the Range header actually supports indexing relative to EOF.

I had a hunch this should be supported, but as it wasn’t part of one of the examples on the MDN page, I overlooked it.

In section 2.1 Byte Ranges of the RFC, the format is clearly specified:

   A client can request the last N bytes of the selected representation
   using a suffix-byte-range-spec.

     suffix-byte-range-spec = "-" suffix-length
     suffix-length = 1*DIGIT

This means we can start right from the initial GET request and just specify a range for the last 100 bytes:

$ curl -s -O -H "Range: bytes=-100" https://rhardih.io/wp-content/uploads/2021/04/test.zip

Here’s the updated Ruby script to match:

5 thoughts on “Listing the contents of a remote ZIP archive, without downloading the entire file”

Julik

April 18, 2021 at 8:22 pm

This is indeed a very viable approach and we’ve used it with great success. A while ago we needed to do this on a very massive scale (millions of ZIPs) and we have developed our findings into the reader in zip_tricks (a Ruby zip library). It will indeed read a ZIP for you using Range: requests. It will also deal with ZIP64 for you (so larger files are possible) and it also fixes a little problem you have in your code where multiple EOCD signatures are placed close to each other in the byte stream, check it out 😉
- René Hansen
  
  April 19, 2021 at 4:24 am
  
  It looks like it boils down to:
  
  io_obj = ZipTricks::RemoteIO.new(some_url) entries = ZipTricks::FileReader.read_zip_structure(io: io_object) entries.each { |e| puts e.filename }
  
  Very nice!
  
  Link for the lazy: https://github.com/WeTransfer/zip_tricks.
Niklas

April 19, 2021 at 3:38 pm

Thats very interesting, I am curious if this could be somewhat used to get a „docker like“ layering system with plain old zip files. 🧐

Thanks for the nice article!
Sam

April 21, 2021 at 10:37 am

Range: bytes=-100

I’d be really interested to survey random ZIP files found on the Internet to see the distribution of comment lengths, too. Does the last 100B of the file include the EOCD on 95% of ZIP files? 99%? 90%?

My gut would prefer an initial value closer to the network MTU/MSS. As long as the response will still fit in one packet, making your range bigger is basically “free”. In your HEAD example, the headers take 957B (gosh, CloudFlare is noisy), so you can ask for 500B and still fit (barely) in the usual MSS of 1460B. But maybe that’s totally unnecessary.

Cool stuff!
Justin Winokur

April 22, 2021 at 3:51 pm

I did something similar to this using Python’s zipfile and a non-cached rclone mount.

The write up is designed around B2 but any rclone remote would work and it removes needing to think about ranges.

https://nbviewer.jupyter.org/gist/Jwink3101/c531b0e1f47504ea528dc4da9716b8de

Ruby implementation

EOF

Addendum

5 thoughts on “Listing the contents of a remote ZIP archive, without downloading the entire file”

Leave a Comment Cancel reply