Just an interesting tidbit of information I discovered when preparing my class on Retrieving and Evaluating Electronic Information (here’s my previous post on planning the class). Covering the topic of bias in search engines, and in particular Google, we talked about how PageRank introduces various bias in the type of information it makes available. I assigned as reading the excellent honor’s thesis (pdf, via the Internet Archive) from 2005 by Stanford undergrad Alejandro M. Diaz. Alejandro’s (where are you now? leave a comment if you read this!) thesis is a straightforward, accessible (if not always “scientific”) account of the different bias that are reflected in Google and Page Rank. A sample quote:
Our description of PageRank, like that put forth by its inventors, makes heavy but unqualified use of the term “important.” This is somewhat disconcerting since importance, like relevancy, is a highly subtle, ambiguous, and subjective thing… To the algorithm, being “important” simply means being “popular.”
It is therefore interesting to see how Google itself changed the way they talk about PageRank. Thanks to the Internet Archive, I give you a direct comparison of the text on the official Google “corporate tech” page, highlighted for your reading pleasure and emphasis:
PageRank performs an objective measurement of the importance of web pages by solving an equation of more than 500 million variables and 2 billion terms. Instead of counting direct links, PageRank interprets a link from Page A to Page B as a vote for Page B by Page A. PageRank then assesses a page’s importance by the number of votes it receives.
– Google, 2002 (via the Internet Archive)
PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.
In fact, the change in language, as you can see on the Internet Archive history for the Google Corporate Technology page was done as late as 2007, and to be accurate, sometime between April 6th and May 6th, 2007 – the same month Google has bought DoubleClick (don’t know what this says but conspiracy theorists are welcome to suggest ideas).
[update Dec 2nd 2010: see Matt Cutts comments about the content from this post here]