Why I needed to do this is a longer story, but this was a question I was looking for an answer to.
Initially it led me to the following SO question:
Is it possible to download just part of a ZIP archive (e.g. one file)?
Not exactly the same problem but close enough. The suggestion here is to mount the archive using HTTPFS and use normal zip tools. The part of the answer that caught my eye was this:
This way the unzip utility’s I/O calls are translated to HTTP range gets
https://stackoverflow.com/a/15321699
HTTP range requests are a clever way to get a web server to only send you parts of a file. It requires that the web server supports it though. You can check if this is the case with a simple curl command. Look for accept-ranges: bytes
.
I’ve added a simple test archive, with some garbage content files, as a test subject here:
$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:01:29 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=d959acad2190d0ddf56823b10d6793c371618754489; expires=Tue, 18-May-21 14:01:29 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 0986e266210000d881823ae000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=mQ4KV6cFG5W5iRV%2FSdu5CQXBdMryWNtlCn8jA29dJC44M8Hl5ARNdhBrIKYrhLCdsT%2FbD8QN07HEYgtWDXnGyV%2BC%2BA2Vj6UTFTC6"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641e6ce9cf77d881-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400
This got me thinking if it might be possible, to construct some minimal set of requests, that only gets the part of the ZIP file containing information about its content.
I didn’t really know anything about the ZIP file format beforehand, so this might be trivial if you are already familiar, but as it turns out, ZIP files contain information about their contents in a data block at the end of the file called the Central Directory.
This means that it’s only this part of the archive that’s required in order to list out the content.
HTTP range requests are specified by setting a header that has the form: Range: bytes=<from>-<to>
, so that means if we can somehow get a hold of the byte offset of the Central Directory and how many bytes it takes up in size, we can issue a range request that should only carry the Central Directory in the response.
The offsets we need are both part of the End of central directory record (EOCD), another data block, which appears after the Central Directory, as the very last part of the ZIP archive. It has variable length, due to the option of including a comment as the last field of the record. If there’s no comment it should only be 22 bytes.
Back to square one. We have to solve the same problem to get just the EOCD, as we have for the Central Directory. Since the EOCD is at the very end of the archive, to
corresponds to the Content-Length of the file. We can get that simply by issuing a HEAD request:
$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:45:22 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=dd56ae29f49cf9931ac1d5977926f61c01618757122; expires=Tue, 18-May-21 14:45:22 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 09870a92ce000010c1d6269000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=Ko46rGYqFfKG0A2iY93XNqjK7PSIca9m9AK5iX9bfUUYr0%2BzdzjMN1IJXQ%2Fn5zjj%2B96d2%2Bnaommr%2FOUaGrzKpqyUjaeme0HGvA1z"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641ead314d8710c1-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400
In the case of the test file; 51987 bytes. So far so good, but here’s where we have to cut some corners. Due to the comment part of the EOCD being variable length, we cannot know the proper from
offset, so we’ll have to make a guess here, e.g. 100 bytes:
$ curl -s -O -H "Range: bytes=51887-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip
[~/Code/stand/by] (master)
$ hexdump test.zip
0000000 7c 60 dd 2f 7c 60 50 4b 01 02 15 03 14 00 08 00
0000010 08 00 5b 79 92 52 58 64 08 f4 05 28 00 00 00 28
0000020 00 00 0e 00 0c 00 00 00 00 00 00 00 00 40 a4 81
0000030 44 a1 00 00 72 61 6e 64 6f 6d 2d 62 79 74 65 73
0000040 2e 31 55 58 08 00 ba 2f 7c 60 dd 2f 7c 60 50 4b
0000050 05 06 00 00 00 00 05 00 05 00 68 01 00 00 95 c9
0000060 00 00 00 00
0000064
Since we’ve most likely have preceding bytes that we don’t care about, we need to scan the response until we find the EOCD signature, 0x06054b50
, (in network byte order). From there extracting the offset and size for the Central Directory is straightforward. In the case above we find it at 0x0000c995
, with a size of 0x00000168
(or 51605 and 360 base 10 respectively).
One more curl command to get the Central Directory:
$ curl -s -O -H "Range: bytes=51605-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip
Notice I’m including the EOCD here, but that’s just so we can use zipinfo on the file. Really to
would be 51965.
Here’s a zipinfo of the original file:
$ zipinfo test.zip
Archive: test.zip
Zip file size: 51987 bytes, number of entries: 5
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed: 0.0%
And here it is of the stripped one:
$ zipinfo test.zip
Archive: test.zip
Zip file size: 382 bytes, number of entries: 5
error [test.zip]: missing 51605 bytes in zipfile
(attempting to process anyway)
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r-- 2.1 unx 10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed: 0.0%
Ruby implementation
A bunch of curl commands is all well and good, but in my case I actually needed it as part of another script, which was written in Ruby.
Here’s a utility function, which basically does the same thing as above, and returns a list of filenames:
EOF
Obviously this whole dance might be a bit of an over-complication for smaller zip files, where you might as well just download the whole thing, and use normal tools to list out the content, but for very large archives maybe there’s something to this trick after all.
If you know of a better or easier way to accomplish this task, feel free to leave a comment or ping me on Twitter.
Over and out.
Addendum
After posting this, it’s been pointed out to me, that the initial HEAD
request is redundant, since the Range
header actually supports indexing relative to EOF.
I had a hunch this should be supported, but as it wasn’t part of one of the examples on the MDN page, I overlooked it.
In section 2.1 Byte Ranges of the RFC, the format is clearly specified:
A client can request the last N bytes of the selected representation
using a suffix-byte-range-spec.
suffix-byte-range-spec = "-" suffix-length
suffix-length = 1*DIGIT
This means we can start right from the initial GET request and just specify a range for the last 100 bytes:
$ curl -s -O -H "Range: bytes=-100" https://rhardih.io/wp-content/uploads/2021/04/test.zip
Here’s the updated Ruby script to match:
This is indeed a very viable approach and we’ve used it with great success. A while ago we needed to do this on a very massive scale (millions of ZIPs) and we have developed our findings into the reader in zip_tricks (a Ruby zip library). It will indeed read a ZIP for you using Range: requests. It will also deal with ZIP64 for you (so larger files are possible) and it also fixes a little problem you have in your code where multiple EOCD signatures are placed close to each other in the byte stream, check it out 😉
It looks like it boils down to:
io_obj = ZipTricks::RemoteIO.new(some_url)
entries = ZipTricks::FileReader.read_zip_structure(io: io_object)
entries.each { |e| puts e.filename }
Very nice!
Link for the lazy: https://github.com/WeTransfer/zip_tricks.
Thats very interesting, I am curious if this could be somewhat used to get a „docker like“ layering system with plain old zip files. 🧐
Thanks for the nice article!
Range: bytes=-100
I’d be really interested to survey random ZIP files found on the Internet to see the distribution of comment lengths, too. Does the last 100B of the file include the EOCD on 95% of ZIP files? 99%? 90%?
My gut would prefer an initial value closer to the network MTU/MSS. As long as the response will still fit in one packet, making your range bigger is basically “free”. In your
HEAD
example, the headers take 957B (gosh, CloudFlare is noisy), so you can ask for 500B and still fit (barely) in the usual MSS of 1460B. But maybe that’s totally unnecessary.Cool stuff!
I did something similar to this using Python’s zipfile and a non-cached rclone mount.
The write up is designed around B2 but any rclone remote would work and it removes needing to think about ranges.
https://nbviewer.jupyter.org/gist/Jwink3101/c531b0e1f47504ea528dc4da9716b8de