20 October 2010
In an effort to collect some real-world data with which to optimize my text differencing algorithms, I wrote a scraper to download data from Wikipedia. It picks a random article, then downloads the current article and its immediately preceding version. Running this scraper over a weekend resulted in 60,000 articles. Over the coming year I'll be crunching this data carefully to produce more accurate, more efficient diff algorithms.
In the mean time, some fun high-level statistics jumped out of the file sizes. Obviously Wikipedia articles come in different sizes. The largest article my scraper found was List of townlands of County Galway at a whopping 395,244 bytes (as measured by the wiki source, not the HTML source). The smallest article my scraper found was Defamer at a measly 103 bytes. The median article in terms of size is Wombling Free at 2,536 bytes. Somehow it seems entirely appropriate that the most representative article in Wikipedia is devoted to the Wombles (if you aren't British and have no idea what a Womble is, take a look).
The histogram for article sizes is very smooth:
Another interesting set of statistics relates to the size of the edits people make to articles. The largest insertion of text my scraper found was to Swimming at the Pan American Games where the article grew by 44,301 bytes as a result of an editor undoing the (unfortunately necessary) damage caused by a robot. The largest deletion of text my scraper found was to Orangeburg Preparatory Schools, Inc. where the article shrank by 672,734 bytes as a result of an editor promptly removing vandalism. The median edit in terms of net size is typified by Maher which grew by 14 bytes with the addition of "United States|" by a robot.
The histogram for edit sizes is best viewed on a logarithmic scale:
If you want to collect your own data from Wikipedia try their API first, and if that fails (as it did for me) take a look at their download site, and if those archives are too huge to process, my scraper is a good starting point. My collection of two versions of 60,000 articles weighs in at 105 MB each and is available here:
Our office building is undergoing renovations. Every evening the construction workers arrive and work through the night, thus leaving us in peace during the day. One morning they left after half installing an electrical outlet. As a warning, they wrote "Hot" on the wall. When they came back the next evening they found a much more graphic warning that a coworker and I created on the floor in front of the outlet.