TAKKARIA.org

dmoz.org Page Fetcher

When working on the C HTML5 parser Hubbub, I wanted a corpus of test data to check that it didn't crash. Inspired by Philip Taylor's efforts, I cooked up a simple system for fetching dmoz.org's database of URLs, choosing a random 1% of them, and then fetching that 1%.

The system is a tad unreliable and relies on you having gcc and make, but it works OK. Instructions are included in the innocuously-named "README" file; it's fairly simple and should be understandable to anyone who knows a smattering of C and make. It relies upon "make -j" to have any decent speed; because the list of URLs to download is split up into a few hundred different files, each file can be treated as a make job and so "make -j 2" will set off two fetches in parallel of a hundred files each.

Sometimes it seems to hang whilst fetching; in that case, Ctrl-C is the way to go, and restart it. This might require a few more downloads than is strictly necessary, and if you work out why it happens, please tell me. :) Pages are fetched into a "pages" directory, which could get a large number of directory entries.

It's available as a 4.5kB tarball.