Listing the contents of a remote ZIP archive, without downloading the entire file

Why I needed to do this is a longer story, but this was a question I was looking for an answer to.

Initially it led me to the following SO question:

Is it possible to download just part of a ZIP archive (e.g. one file)?

Not exactly the same problem but close enough. The suggestion here is to mount the archive using HTTPFS and use normal zip tools. The part of the answer that caught my eye was this:

This way the unzip utility’s I/O calls are translated to HTTP range gets

https://stackoverflow.com/a/15321699

HTTP range requests are a clever way to get a web server to only send you parts of a file. It requires that the web server supports it though. You can check if this is the case with a simple curl command. Look for accept-ranges: bytes.

I’ve added a simple test archive, with some garbage content files, as a test subject here:

$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:01:29 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=d959acad2190d0ddf56823b10d6793c371618754489; expires=Tue, 18-May-21 14:01:29 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 0986e266210000d881823ae000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=mQ4KV6cFG5W5iRV%2FSdu5CQXBdMryWNtlCn8jA29dJC44M8Hl5ARNdhBrIKYrhLCdsT%2FbD8QN07HEYgtWDXnGyV%2BC%2BA2Vj6UTFTC6"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641e6ce9cf77d881-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400

This got me thinking if it might be possible, to construct some minimal set of requests, that only gets the part of the ZIP file containing information about its content.

I didn’t really know anything about the ZIP file format beforehand, so this might be trivial if you are already familiar, but as it turns out, ZIP files contain information about their contents in a data block at the end of the file called the Central Directory.

This means that it’s only this part of the archive that’s required in order to list out the content.

On-disk layout of a binary file format that is both the ZIP container and its relevance in the 64-bit format.
https://commons.wikimedia.org/wiki/File:ZIP-64_Internal_Layout.svg

HTTP range requests are specified by setting a header that has the form: Range: bytes=<from>-<to>, so that means if we can somehow get a hold of the byte offset of the Central Directory and how many bytes it takes up in size, we can issue a range request that should only carry the Central Directory in the response.

The offsets we need are both part of the End of central directory record (EOCD), another data block, which appears after the Central Directory, as the very last part of the ZIP archive. It has variable length, due to the option of including a comment as the last field of the record. If there’s no comment it should only be 22 bytes.

Back to square one. We have to solve the same problem to get just the EOCD, as we have for the Central Directory. Since the EOCD is at the very end of the archive, to corresponds to the Content-Length of the file. We can get that simply by issuing a HEAD request:

$ curl --head https://rhardih.io/wp-content/uploads/2021/04/test.zip
HTTP/2 200
date: Sun, 18 Apr 2021 14:45:22 GMT
content-type: application/zip
content-length: 51987
set-cookie: __cfduid=dd56ae29f49cf9931ac1d5977926f61c01618757122; expires=Tue, 18-May-21 14:45:22 GMT; path=/; domain=.rhardih.io; HttpOnly; SameSite=Lax
last-modified: Sun, 18 Apr 2021 13:12:45 GMT
etag: "cb13-5c03ef80ea76d"
accept-ranges: bytes
strict-transport-security: max-age=31536000
cf-cache-status: DYNAMIC
cf-request-id: 09870a92ce000010c1d6269000000001
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=Ko46rGYqFfKG0A2iY93XNqjK7PSIca9m9AK5iX9bfUUYr0%2BzdzjMN1IJXQ%2Fn5zjj%2B96d2%2Bnaommr%2FOUaGrzKpqyUjaeme0HGvA1z"}],"max_age":604800}
nel: {"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 641ead314d8710c1-CPH
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400

In the case of the test file; 51987 bytes. So far so good, but here’s where we have to cut some corners. Due to the comment part of the EOCD being variable length, we cannot know the proper from offset, so we’ll have to make a guess here, e.g. 100 bytes:

$ curl -s -O -H "Range: bytes=51887-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip
[~/Code/stand/by] (master)
$ hexdump test.zip
0000000 7c 60 dd 2f 7c 60 50 4b 01 02 15 03 14 00 08 00
0000010 08 00 5b 79 92 52 58 64 08 f4 05 28 00 00 00 28
0000020 00 00 0e 00 0c 00 00 00 00 00 00 00 00 40 a4 81
0000030 44 a1 00 00 72 61 6e 64 6f 6d 2d 62 79 74 65 73
0000040 2e 31 55 58 08 00 ba 2f 7c 60 dd 2f 7c 60 50 4b
0000050 05 06 00 00 00 00 05 00 05 00 68 01 00 00 95 c9
0000060 00 00 00 00
0000064

Since we’ve most likely have preceding bytes that we don’t care about, we need to scan the response until we find the EOCD signature, 0x06054b50, (in network byte order). From there extracting the offset and size for the Central Directory is straightforward. In the case above we find it at 0x0000c995, with a size of 0x00000168 (or 51605 and 360 base 10 respectively).

One more curl command to get the Central Directory:

$ curl -s -O -H "Range: bytes=51605-51987" https://rhardih.io/wp-content/uploads/2021/04/test.zip

Notice I’m including the EOCD here, but that’s just so we can use zipinfo on the file. Really to would be 51965.

Here’s a zipinfo of the original file:

$ zipinfo test.zip
Archive:  test.zip
Zip file size: 51987 bytes, number of entries: 5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed:  0.0%

And here it is of the stripped one:

$ zipinfo test.zip
Archive:  test.zip
Zip file size: 382 bytes, number of entries: 5
error [test.zip]:  missing 51605 bytes in zipfile
  (attempting to process anyway)
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.3
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.4
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.5
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.2
-rw-r--r--  2.1 unx    10240 bX defN 21-Apr-18 15:10 random-bytes.1
5 files, 51200 bytes uncompressed, 51225 bytes compressed:  0.0%

Ruby implementation

A bunch of curl commands is all well and good, but in my case I actually needed it as part of another script, which was written in Ruby.

Here’s a utility function, which basically does the same thing as above, and returns a list of filenames:

EOF

Obviously this whole dance might be a bit of an over-complication for smaller zip files, where you might as well just download the whole thing, and use normal tools to list out the content, but for very large archives maybe there’s something to this trick after all.

If you know of a better or easier way to accomplish this task, feel free to leave a comment or ping me on Twitter.

Over and out.

Addendum

After posting this, it’s been pointed out to me, that the initial HEAD request is redundant, since the Range header actually supports indexing relative to EOF.

I had a hunch this should be supported, but as it wasn’t part of one of the examples on the MDN page, I overlooked it.

In section 2.1 Byte Ranges of the RFC, the format is clearly specified:

   A client can request the last N bytes of the selected representation
   using a suffix-byte-range-spec.

     suffix-byte-range-spec = "-" suffix-length
     suffix-length = 1*DIGIT

This means we can start right from the initial GET request and just specify a range for the last 100 bytes:

$ curl -s -O -H "Range: bytes=-100" https://rhardih.io/wp-content/uploads/2021/04/test.zip

Here’s the updated Ruby script to match:

Interactive visualisation of Pi and friends with D3.js

Inspiration

I recently stumbled onto the magnificent posters created by Martin Krzywinski over at http://mkweb.bcgsc.ca/pi/. He’s come up with some truly original ways to illustrate the appearance and complexity of various irrational numbers.

Specifically, I really liked the minimalism and simplicity of the 2013 edition, shown here:

pi-dots-01

The rules used for generating the coloured dots is fascinatingly simple. Each digit 0-9 is assigned a unique colour. The i’th circle is then coloured according to the value of the i’th digit of Pi. The smaller circle inside is coloured based on the value of the following digit.

Simple rules, complex outcome.

Martin created other versions as well, e.g. one where adjacent equals are connected by same coloured lines. They’re all fascinating and you can even buy them as posters!

Interactive

Looking at these posters I couldn’t help wonder, what lay beyond the chosen boundaries for each poster. What if some great pattern or sequence was lurking just outside of view? If only there was a way to go explore further digits and other constellations using the same visualisation format.

Always the tinkerer, this got me thinking about how to create something like that. One thing led to another and before long I had a rough prototype cobbled together with D3.js.

I spent a bit more time adding a few input controls and polishing it into a neat little demo. I’ve put it up at:

winski.rhardih.io

Go try it out! Instructions are at the bottom.

As an example, this is how the Feynman Point looks, at a column width of 31:

Screen Shot 2016-06-10 at 12.30.29

For the more curious out there, I’ve put the code on GitHub at github.com/rhardih/winski.

Happy π hunting!

Dead simple concurrency limitation in Go

Backgrounding long running tasks is a classic and ubiquitous problem for web applications.

E.g. Something triggered the need to download and manipulate a file, but we don’t want to hold up the main thread responsible for bringing a response back to the client.

Most likely you’d want to offload this task to a background worker or another service, but sometimes it’s nice to be able to just handle the processing right then and there.

Go makes concurrency incredibly simple with the go keyword. So simple in fact, that you might quickly run into problems if you are handling files in this manner and just spin up new goroutines for everything.

File descriptors

Herein lies the problem. Most systems don’t have an unlimited number of file descriptors for a process to use. Probing the two machines within my current reach yields the following:

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.11.4
BuildVersion:   15E65
$ ulimit -n
256
~$ cat /etc/issue
Ubuntu 14.04.4 LTS \n \l
 
~$ ulimit -n
1024

Evidently there isn’t all that many available on either system. Opening a ton of sockets and files at once, will inevitably lead to errors; “Too many open files” or similar.

Limitation

Go’s concurrency primitives lends itself to definitions laid out by Tony Hoare in Communicating sequential processes and as such Go has the concept of channels.

Channels provides the necessary “blocking” mechanism, to allow a collection of goroutines to handle a common workload. Instead of creating a lot of goroutines all at once, rather a handful can be created, each waiting to handle incoming items on a channel.

Below is a very simple example, where a, (buffered channel with five slots*), channel is created as a work queue. Subsequently five goroutines are created and each set to wait on incoming work from the queue. In the main thread, the queue is then filled up all at once with work for the goroutines.

Once all the work have been loaded onto the queue, the channel is closed, which is Go’s way of telling consuming goroutines that nothing more will appear on the channel. This works in tandem with the range keyword to keep receiving on the channel until it is closed.

Note, the WaitGroup is effectively a Monitor; another concurrency construct that allows the main thread to wait for all the goroutines to finish.

Running the program produces the following output:

$ go run conc.go
10:25:37 Work a enqueued
10:25:37 Work b enqueued
10:25:37 Work c enqueued
10:25:37 Work d enqueued
10:25:37 Work e enqueued
10:25:37 Work f enqueued
10:25:39 Worker 1 working on a
10:25:39 Worker 3 working on d
10:25:39 Work g enqueued
10:25:39 Work h enqueued
10:25:39 Worker 2 working on b
10:25:39 Work i enqueued
10:25:39 Worker 4 working on e
10:25:39 Work j enqueued
10:25:39 Worker 0 working on c
10:25:39 Work k enqueued
10:25:41 Worker 0 working on j
10:25:41 Worker 3 working on g
10:25:41 Worker 2 working on h
10:25:41 Worker 1 working on f
10:25:41 Worker 4 working on i
10:25:41 Work l enqueued
10:25:41 Work m enqueued
10:25:41 Work n enqueued
10:25:41 Work o enqueued
10:25:43 Worker 0 working on k
10:25:43 Worker 1 working on n
10:25:43 Worker 3 working on l
10:25:43 Worker 2 working on m
10:25:43 Worker 4 working on o

Go’s concurrency primitives is that rare combination of easy and powerful, making it effortless to write threaded code.

It doesn’t save you from inherent limitations of the host system however, which is a good thing. Awareness of what the code actually does on the machine, is a virtue to strive for.

Edit: Thanks to Jemma for pointing out that buffering the channel isn’t needed afterall.

This post was included in the Go Newsletter issue 110.

Page specific Javascript in Rails 3

Premise

One of the neat features from Rails 3.1 and up is the asset pipeline:

The asset pipeline provides a framework to concatenate and minify or compress JavaScript and CSS assets. It also adds the ability to write these assets in other languages such as CoffeeScript, Sass and ERB.

This means that in production, you will have one big Javascript file and also one big CSS file. This reduces the number of request the browser has to make and generally loads the page faster.

In the case of Javascript concatenation however, it does bring about a problem. Executing code when the DOM has loaded is commonplace in most web applications today, but if everything is included in one big file, and more importantly the same file, for all actions on all controllers, how do you run code that is specific to just a single view?

Solution(s)

Obviously there is more than one way of solving this problem, and rather unlike Rails, there doesn’t seem to be any “best practice” dictated. The closest I found is this excerpt from section 2 of the Rails Guide about the Asset Pipeline:

You should put any JavaScript or CSS unique to a controller inside their respective asset files, as these files can then be loaded just for these controllers with lines such as <%= javascript_include_tag params[:controller] %> or <%= stylesheet_link_tag params[:controller] %>.

And it isn’t even followed by an example, which seems more of an indication, that this isn’t something you should do at all.

Let’s start by this example nonetheless.

Per controller inclusion

By default Rails has only one top level Javascript manifest file, namely app/assets/javascripts/application.js which has the following content:

// This is a manifest file that'll be compiled into including all the files listed below.
// Add new JavaScript/Coffee code in separate files in this directory and they'll automatically
// be included in the compiled file accessible from http://example.com/assets/application.js
// It's not advisable to add code directly here, but if you do, it'll appear at the bottom of the
// the compiled file.
//
//= require jquery
//= require jquery_ujs
//= require_tree .

And this is included in the default layout with:

<%= javascript_include_tag "application" %>

N.B. When testing production on localhost, with e.g. rails s -e production, rails by default wont serve static assets, which application.js becomes after pre-compilation, so to avoid any problems when locally testing production, the following setting needs to be changed from false to true in config/environments/production.rb:

# Disable Rails's static asset server (Apache or nginx will already do this)
config.serve_static_assets = true

Now let’s say we have a controller, let’s call it ApplesController, and its corresponding Coffescript file, apples.js.coffee. We might try to include it as per the Rails Guide suggestion like so:

<%= javascript_include_tag params[:controller] %>

And this will work just fine in development mode, but in production produce the following error:

ActionView::Template::Error (apples.js isn't precompiled):

To remedy this, we need to do a couple of things. First off we should remove the require_tree . part from application.js, so we don’t wind up including the same script twice. Just removing the equal sign is enough:

//  require_tree .

To avoid a name clash rename apples.js.coffee to something else, e.g. apples.controller.js.coffee. Then create a new manifest file named apples.js, which includes your coffeescript file:

//= require apples.controller

Lastly, the default configuration of Rails only includes and pre-compiles application.js, so we need to tell the pre-compiler to now also include apples.js. This is also in config/environments/production.rb. Uncomment the following setting, and change search.js to apples.js:

# Precompile additional assets (application.js, application.css, and all non-JS/CSS are already added)
config.assets.precompile += %w( apples.js )

Note that this is a match, so it could also be something like '*.js' in case you have more manifests, which would be the case for per controller inclusion.

Views

The same concept as above could be extended to target individual actions/views of each controller, by having the actions be part of the manifest name. Individual javascript files could then be included like so:

<%= javascript_include_tag "#{params[:controller]}.#{params[:action}" %>

This makes an assumption that all actions on all controllers have a dedicated Javascript file. An assumption which most likely won’t be true in most cases. Another option could be an conditional include like so:

<%= yield :action_specific_js if content_for?(:action_specific_js %>

And then move the include tag to the specific views that need it.

Testing for existence of a page element or class

The DOM loaded event handler could look something like this:

jQuery ->
  if $('#some_element').length > 0
    // Do some stuff here

This could also be a class on body eg.:

jQuery ->
  if $('body.controller_name_action_name').length > 0
    // Do some stuff here

And then then the erb would be like this:

<!DOCTYPE html>
<html>
<head>
  <title>AppName</title>
  <%= stylesheet_link_tag    "application" %>
  <%= javascript_include_tag "application" %>
  <%= csrf_meta_tags %>
</head>
<body class="<%= "#{params[:controller]}_#{params[:action]}" %>">
 
<%= yield %>
 
</body>
</html>

Function encapsulation and on-page triggering

Instead of registering the handlers for DOM loaded, wrap the necessary code in a function that can be called later and then trigger that function directly in the respective view.

There is one thing we need to consider though. All Coffeescript sources for each controller get wrapped in it’s own closed scope, i.e this Coffeescript in apples.js.coffee:

apples_index = ->
  console.log("Hello! Yes, this is Apples.")

Becomes:

((
  function(){
    var a;
    a=function(){
        return console.log("Hello! Yes, this is Apples.")
    }
  }
)).call(this);

So in order for us to have a globally callable function, we must first expose it somehow. We can do this by attaching the function to the window object. Changing the above code like so:

window.exports ||= {}
window.exports.apples_index = ->
  console.log("Hello! Yes, this is Apples.")

If we insert this line in application.html.erb layout just before the closing body tag:

<!DOCTYPE html>                                                                           
<html>                                                                                    
<head>                                                                                    
  <title>AppName</title>                                                                   
  <%= stylesheet_link_tag    "application" %>                                             
  <%= javascript_include_tag "application" %>                                             
  <%= csrf_meta_tags %>                                                                   
</head>                                                                                   
<body>                                                                                    
 
<%= yield %>                                                                              
 
<%= yield :action_specific_js if content_for?(:action_specific_js) %>                     
</body>                                                                                   
</html>

We can now call the exposed function directly from our view like so:

<% content_for :action_specific_js do %>
<script type="text/javascript" language="javascript">
  $(function() { window.exports.apples_index(); });
</script>
<% end %>

Wrap up

Neither of these three examples is a “one-fit-all” solution I would say. Dividing up the Javascript source will start to make sense as soon as the Javascript codebase grows past a certain size. It might be interesting to test out, just how big that size is on a certain bandwidth, but I think that’s out of the scope for this post.

Given the fact that there isn’t really a defined best practice yet, perhaps the ruby community will come up with something better than the examples I presented here. In my opinion I think this is definitely something that could be better thought out.