Most social scientists shy away from command-line tools. They live in a world of drag & drop, just hoping for the right tool to get their dirty work done. Aside from the unfortunate fact that data is hard to come by in the first place, it is also far too messy (and far too interesting, I must add) to be boiled and cooked with your standard all-in-one tool. Just consider the unlimited variety of weblogs and wikis. Yes, they all share a basic structure, chronologically reversed postings in case of weblogs, for example. But all in all a quantitative analysis of these websites takes a considerable amount of coding, that is, dirty work. The tools that claim to do the analysis all by themselves are a little bit like fast food, there is only a limited variety and everything tastes the same.
Now, if you live on the lighter side of live, you might enjoy a good home-cooked meal once in a while, don’t you? The same goes for work, why settle for the simple statistics you get with standard software when there is so much more to science right under the hood of all those cluttered windows. All you need to do is to put a little faith in the command line and you’re ready to go. Just follow the below three-steps instructions, it’s as easy as making a ham sandwich.
First, there’s the ingredients. The below shopping list is just a suggestion, of course. If you look hard enough, you may well find a good or even better substitute for any of the following tools. A note to the cook, the list is based on the assumption that your stove is a Mac or a Linux machine, although most of the tools work reasonably well on Windows.
With all the ingredients on the table or desktop, respectively, take a look at wget. “GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.” The basic usage is
> wget [option] [URL]
The real beauty of wget is its ability to crawl an entire site and thus mirror it locally. Just turn on the -r option for recursive download and you’ll get all pages off a site without even knowing their URLs. But be careful to restrict the host to get data from, otherwise wget follows all links and you get a lot (and I mean, a LOT) more traffic than you want. So in order to get, say, the top rated Bamblog, you need to call
> wget -r -H -D bamblog.de http://www.bamblog.de
It may take several minutes to get the entire site, obviously depending on the site’s size and the server’s speed that the site is hosted at. If you’re only interest in text analysis, there is an option to exclude the download of picture and other media. See the wget manual for details.
With wget’s job done, you get a local copy of the entire site in a single folder. Inside that folder you’ll likely find several hundred of files, most of them you are not even interested in. With an eye on the content of the site, you want to look for those HTML files that contain the actual postings of a weblog, the pages of a wiki, or the like. For example, Wordpress by default numbers all postings and puts them in an archive folder. Most likely, those files are the ones you want to analyze.
Before the actual analysis, though, you need to boil your data for some time, that is, turn the raw data into something your CAQDAS can manage. This is where textutil or DocFrac comes in. “textutil can be used to manipulate text files of various formats, using the mechanisms provided by the Cocoa text system.” In effect, it converts all HTML files into RTFs, which most CAQDAS can process. On the command line, you pass all files with a .html extension to textutil to convert to .rtf with
> find . -name \*.html -print0 | xargs -0 textutil -convert rtf.
Now all that’s left to do is to load the RTFs into TAMS, Weft, or any other tool and analyze away, that is, code the content and see what the data holds. In case of Jan Schmidt’s Bamblog, you currently get a little more than 600 postings including all comments. That’s a lot of coding to do and you may wish to get some of it done in routine fashion. Well, there is help, at least for the site’s basic structure. Unfortunately, this is also where it gets a little more messy. The simple command line call of any of the above tools it not enough, you need to get yourself familiarized with regular expressions. This will help you to automatically code the structure of the RTF file any which way you need it. For example, postings always start with the title and the date they are posted on. If you search for this particular pattern, say, August 13, 2007 with
> (Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember) ([0-9]+, 200[0-9])
and replace it with
> \'7bdate\'7d$1 $2\'7b\'5cdate\'7d,
then you’ll get your RTF to look something like
> {date}August 13, 2007{/date}.
Do this for any recurring pattern with the code you need and you’ll be able to analyze your data in no time. For this last example, I simply counted all comments (i.e., all occurences of my code ) on each one of the 600+ pages.

All in all, the cooking time was a little more than two hours from the intial download of the data to the plot above. The largest part of the prepartion is to figure out the structure of the data and thus what to search for and replace it with. However, once you’ve done that, it may fit other data sources as well, for weblogs frequently use Wordpress and wikis based on MediaWiki are in the thousands. The sky is the limit.
Bon appétit.