wiiki - WikiMines

NickKnowles and I spend a few days looking at how wiki pages are linked. We wrote a small object model for pages and the links between them. This was coded in SqueakSmalltalk and populated from a file we first extracted from the wiki database. Once in memory we could write ad hoc queries against this model and expect them to be answered in seconds. Here are a few of our initial observations ...

RonJeffries has signed the most posts by a factor of three.
ExtremeProgramming is the most talked about subject, but still a small proportion of wiki. (I get "patterns" as having the most pages, see below. -- AlistairCockburn)
It may be possible to identify topics by studying link structure alone.
It's a good idea to ignore signatures unless they are what you are studying (a corollary - it is helpful if wiki authors adhere to the -- convention so that signature links can be identified).
For any given page, the most frequently cited page is usually itself.

I know none of this comes as a surprise. For that reason I invite you to try your hand at mining wiki to see what you can discover. Here is a zip of the text file we used to load our application ...

http://c2.com/wiki/links.zip (215k)

When unzipped this produces a single file, links.txt, that contains one line for each wiki page. The first word of the line is the page name. Subsequent words are links leading from the page. Additional punctuation indicate the presence of horizontal rules and signature indications. Here is a sample line for a particularly short page ...

ColorOutsideTheLines --RaySchneider | KnowTheRule --RonJeffries

Please have a look at this data and tell us what you find. -- WardCunningham

The above zip file has not been updated since June 1999. For current links info, use http://c2.com/wiki/links.txt, which is updated nightly.

Jeff Grigg's quick analysis of the links text from above:

559 of 4970 pages (11%) have "Xp" or "Extreme" in title or content references.

(This includes a few "false positives" like ActiveXpert and XpFreeZone, but looks pretty close to being "right.") On second thought, I'm pretty sure it underestimates XP influence, as there are a number of pages, like "DoWhatYouSaySayWhatYouDo" which contain references to XP concepts (CodeUnitTestFirst in this case), but do not contain "Extreme" or "Xp". Tracking these down would be tedious.

498 pages with "Extreme," 158 with "Xp" (= 17% overlap).

Analysis done with Vim, a more featureful "vi" clone with a Windows version.

I found 760 lines containing the word "patterns" out of the 4970, so I nominate that patterns still has more pages than XP. (Analysis done with MS Word search and replace facility and line count tool.) -- AlistairCockburn

I'll have to look for the cite, but there was an article in CACM a few months ago about using link structure to mine the web. -- MichaelFeathers

See also WikiReadingHabits for more insight on Wiki usage.

The technology behind Netscape's "what's related" button ...

Alexa maintains a multi-terabyte central database that stores both the behavioral patterns of Web travelers and categorical information about the actual content on the Web, the latter of which surfaces Site Statistics about any page on the Web. Related Links are generated using sophisticated data mining techniques along with intelligent technologies, both of which identify usage patterns and the relationships between the pages, based on common hypertext links and similarities in textual content.

The technology in alexa & the bots is ingenious and powerful, but mining the Web in the large is a different (and harder) problem than mining a wiki.

Wiki ought to have certain advantages when it comes to analysing the links and the link traffic for semantic insights. For example

Each page typically corresponds to one main concept, which has been carefully named
A wiki covers a restricted subject domain and is relatively small
There are very few links that are not meaningful (in contrast to a typical web site that has a lot of glue pages that don't say anything much)
Certain web pages (eg users) and links (eg signatures/attributions) can be typed and the type taken into consideration when analysing.

My impression is that it would be quite possible to have indexing bots operating over a wiki that would be able to cluster the the pages into useful RoadMap like indexes pretty well automatically. -- NickKnowles

True, but some of the ideas might still be useful in this domain, or at least suggestive of other strategies. Here's a nice article from last month's Scientific American that contains some very good ideas: http://www.sciam.com/1999/0699issue/0699raghavan.html. -- GlennVanderburg

Further reading:

CategoryCategory (and topics) - for manual efforts to organize Wiki pages.
EverythingTwo -- adds links between pages based on people's navigation. (The site is kinda' like Usenet or Wiki, only not as good.)

A wiki seems like a very appropriate thing to run through a WebSom (see http://websom.hut.fi/websom/) because all the pages are in a fairly narrow universe of discourse, mostly in the same language, and consist almost entirely of relevant text. The websom doesn't take into consideration the link structure, though, and something tells me that's crucial to capturing information about a wiki. Also, some of the WikiClones that store (or can recover) meta-information, like who edited what and when, could look for "dialog" patterns, ie. Tom always responds to Mary's comments. -- SeanScoggins

Wiki stats for May 27, 2000

 Num pages: 9622
 Size sum: 19,904,634 bytes
 Mean size: 2069 bytes

 3986 different page sizes.
 Min size: 1 byte (6 pages this size)
 Max size: 150827 bytes (1 page)
 Median size: 767

 Size distribution
  %   size
  1     16
  2     21
  3     25
  4     29
  5     34
 10     67
 15    111
 20    166
 25    231
 30    310
 35    392
 40    502
 45    618
 50    767
 55    965
 60   1190
 65   1494
 70   1854
 75   2324
 80   2989
 85   3917
 90   5295
 95   7878
 96   8846
 97  10720
 98  13385
 99  17842

 16 largest pages:
  #          size
  9607    30000
  9608    30038
  9609    30316
  9610    30350
  9611    31059
  9612    32288
  9613    33565
  9614    35564
  9615    36237
  9616    36353
  9617    36780
  9618    46070
  9619    88372
  9620   106063
  9621   122727
  9622   150827

Ahh, yes! Now I get it. It's all about digging Wiki <ahem>, as opposed to planting explosives here. Okay.

See: WikiWordStatistics

CategoryWiki, CategoryWikiStructure