Speeding up with parallel compression – pbzip2

Premise

Today I found myself in need of archiving some virtual machines, which is quite often rather large. The actual machine I was working on was a 4 core, 8 with HT, Xeon powerhouse and I was curious to see if there was any way to speed up compression times for this particular task.

Looking into things

Usually I always grab for my trusty old friend tar when creating archives and it does get the job done well. The thing about tar though, is that it is inherently single-threaded, so it doesn’t really matter how many CPU cores you throw at it.

After digging around a bit I found pbzip2. Description:

pbzip2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines.

Sounds good right? I decided to try i out and measure the results. The size of the virtual machine was about 11G:

~$ du -hs *
11G     WinXP_32Bit

With just plain old tar it took about eight minutes:

~$ time tar zcvf winxp.tar.gz WinXP_32Bit
 
real    8m18.583s
user    6m47.089s
sys     0m15.129s

Not bad, but that is nothing compared to piping it through pbzip2:

~$ time tar -c WinXP_32Bit | pbzip2 -c > winxp.tar.bz2
 
real    4m54.942s
user    38m22.452s
sys     0m25.022s

Screenshots of htop to show the difference in cpu core utilisation:

Both resulting archives were of equal size, so the immediate benefit is purely speed:

~$ du -hs *
11G     WinXP_32Bit
6.2G    winxp.tar.bz2
6.2G    winxp.tar.gz

For good measure, I also timed the decompression speeds. Though there was still a gain in speed, it was not quite as significant as with compression:

~$ time tar zxvf winxp.tar.gz
 
real    5m8.636s
user    1m20.061s
sys     0m20.413s
~$ time pbzip2 -d winxp.tar.bz2
<
real    4m32.329s
user    13m15.814s
sys     0m19.057s

Some things might be said for a lot of other limiting factors such as disk read/write speed etc. Playing around with different settings of pbzip2 might also reveal greater performance boosts than this simple example, but by standard, it is now a welcome addition to my *nix toolkit.

One-liner for downloading music files from a Podcast feed, with grep, sed and wget

Premise

I like to listen to music while I code and usually it’s in the form of online radio or a podcast. I believe most portable music players have the ability to sync with podcast feeds nowadays, but I’m an old fashioned guy and sometimes I like to have the music files readily available on my own hard-drive, so I can move them about as I please.

Command

As an example, here is a feed from Tiësto’s club life, on podcast I listen to quite a lot:

http://www.radio538.nl/clublife/podcast.xml

Download it and the run this command:

  grep -E "http.*\.m4a" podcast.xml | sed "s/.*\(http.*\.m4a\).*/\1/" | xargs wget

This is a three step command, that does the following:

  • Extracts all lines, that contains the .m4a file urls using grep.
  • Strips all characters that are not part of the urls.
  • Feeds each line to wget one by one

Ofcourse this is very rudimentary and could easily be transformed into a more general purpose tool in a script. Setting for instance file types via parameter, or even avoiding having to download the feed file first. But this solved the job at hand for me.

Feel free to adapt in any way you please.