Author Archives: naaman

Google’s Technology Statement: Objective “Importance”?

Just an interesting tidbit of information I discovered when preparing my class on Retrieving and Evaluating Electronic Information (here’s my previous post on planning the class). Covering the topic of bias in search engines, and in particular Google, we talked about how PageRank introduces various bias in the type of information it makes available. I assigned as reading the excellent honor’s thesis (pdf, via the Internet Archive) from 2005 by Stanford undergrad Alejandro M. Diaz. Alejandro’s (where are you now? leave a comment if you read this!) thesis is a straightforward, accessible (if not always “scientific”) account of the different bias that are reflected in Google and Page Rank. A sample quote:

Our description of PageRank, like that put forth by its inventors, makes heavy but unqualified use of the term “important.” This is somewhat disconcerting since importance, like relevancy, is a highly subtle, ambiguous, and subjective thing… To the algorithm, being “important” simply means being “popular.”

It is therefore interesting to see how Google itself changed the way they talk about PageRank. Thanks to the Internet Archive, I give you a direct comparison of the text on the official Google “corporate tech” page, highlighted for your reading pleasure and emphasis:

PageRank performs an objective measurement of the importance of web pages by solving an equation of more than 500 million variables and 2 billion terms. Instead of counting direct links, PageRank interprets a link from Page A to Page B as a vote for Page B by Page A. PageRank then assesses a page’s importance by the number of votes it receives.

Google, 2002 (via the Internet Archive)

PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.

Google, 2009

In fact, the change in language, as you can see on the Internet Archive history for the Google Corporate Technology page was done as late as 2007, and to be accurate, sometime between April 6th and May 6th, 2007 – the same month Google has bought DoubleClick (don’t know what this says but conspiracy theorists are welcome to suggest ideas).

[update Dec 2nd 2010: see Matt Cutts comments about the content from this post here]

A Permanent Divide: Articulation of Social Networks?

Continuing on my thoughts following Paul DiMaggio’s visit, about the digital divide. Mobile information access may be one technological element that will mark the divide going forward. But there might be another element that might create a social divide that will be very hard to bridge.

I’m talking about social networks, and in particular, the potential articulation of social networks online.

It’s not only the case that you can see who someone’s friends are, you can figure out how that person is connected to you in 1, 2, 3 or 10 links. Such information could be widely available on the web (already available in LinkedIn, and may become available on Facebook any time). This articulation of social ties may lead to even more favoritism and rich-get-richer in making friends and, say, hiring decisions.

We already know that homophily plays a big role in offline (great overview from Miller McPherson et al.) and online social networks. With the disappearance of web anonymity (on Facebook, almost everyone is the real him or her) it is likely that these networks homophily trends will continue to dominate and even exacerbate.

What does this mean? Employers will have the choice not only to select employees who are friends or from the same alma-mater; they will be able to see how distant an applicant is from them in the social network (and perhaps examine the applicant’s direct social network ties). This is not a digital divide: maybe the technically-weak are not part of online social networks at all yet (a fact that already hurts them), but even when they join the online world and join a social network, their potential immeidiate ties in the virtual world may actually hurt their social advancement because of that new social network selection bias.

And that’s before Facebook even starts “PageRanking” people.

Or, you can take a positive view and claim that if the span of a social network is not that wide (say, a low mean shortest path between each two elements in a fully-connected social graph) such articulation of ties might actually be beneficial to “weaker” nodes in that graph as it will help them connect to more “important” people. Which will it be?

The Next Digital Divide: Mobile?

Nice thing about being at a school that has a significant portion of social science and humanities academics: I frequently now hear speakers that go beyond technology, or at the very least look at it from a different angle.

Princeton sociology professor Paul DiMaggio’s presentation at Rutgers was exactly that. Titled Digital Inequality, his talk described research (done with his students) that attempted to answer important questions about the digital divide:

  • Does the digital divide exist? What differences exist between different groups, and did the magnitude of the difference change over time?
  • Does it matter? In other words, what is the impact of technology access on people’s economic and social status?
  • Will it continue, or is the gap likely to be closed at some point?

Paul described a series of interesting (and often innovative) quantitative analysis studies that answered these questions (yes, and slightly mitigated; yes, potentially significant; and no, it is likely to persist). Squeezing all these studies in one hour left many details out (I guess I could read the papers…) but made a fascinating and informative talk.

Of course, as the digital divide refers to differences in access to “information technology”, the studies so far referred to PC/computer and (later) Internet access. That raised the questions: what is the next barrier to create the digital divide? In other words, after every child has a laptop and free Internet access (or something), where will be the new divide?

One possible answer is mobile devices and services – the mobile web. Cara Wallis told us last week that even low-income (local) immigrants in China invest more than they can afford in their mobile device, but nevertheless, low-income populations worldwide are still likely to be locked out of getting advanced mobile devices and access to expansive ($20-$30 a month in the US) mobile access plans.

So, I am pretty sure the digital divide exists and will deepen in the domain of mobile computing. However, what about Paul’s second question: does it matter? Does that fact that I can look up the nearest and most recommended Chinese restaurant, wherever I am; or listen to NPR stations from California on the NJ Transit impact my economic or social status? My colleague (and chair of our Communication department) Jim Katz hints that at least social status can be gained.

Of course, an alternative viewpoint could say that mobile devices can actually be cheap and available enough to actually reduce the barriers and mitigate the digital divide. It’s true, a good portion of the population in developing countries holds a cell phone, but those are yet to do anything beyond text messaging and voice.

Next time on A& the socio-digital divide (Naaman’s additional factor in the future of the digital divide).

What is Social Media

Ian lured me out with this one, claiming that all media is now “social media”:

We’ve reached a tipping point. In my mind the lines between social media  and other types of media are so blurred that it’s not even useful to distinguish the two, just drop the “social” because all media is now social.

As someone who shares a blog (with Ayman, no less) that has the term “social media” in the (sub)title – I thought I need to provide my view on the matter. Well, here is my definition:

Social media: Online media published or shared by individuals and organizations, in an environment that encourages significant individual participation and that promotes curation, discussion and re-use.*

So, is everything “Social Media”? Not yet, I don’t think so. Let’s look at my definition above, which is closer to Stowe Boyd’s definition from Ian’s earlier post. There are several key words in the definition that explain my claim. The main one is “significant individual participation”. The NY Times article comments, for example, do not allow such participation. Yes, one can comment and discuss, but way below the fold and on a different page altogether. The contribution is not significant, even for the small crowd that makes the jump.

On the other hand, Twitter for example is an environment where individual participation is the main feature and fall comfortably into the heart on my definition above.

What about blogs? Well, it depends. Blogs of personal authors are by definition “social media”. But the more mainstream “blogs” (or, say, alternative news outlets) are not social media unless they give the viewers/readers/visitors a significant voice and participation in the conversation. Yes, it might not be easy to make the call for any specific blog. What do you think this is, mathematics?

Of course, as Ian notes, the CNN/Facebook inaugural “experience” is certainly “social media” even according to mine definition. In fact, as I commented on Ayman’s previous post, the CNN-Facebook inaugural address was a game changer that will be marked as the moment that TV watching had changed forever.

Just a few last notes about the definition above. Three key concepts there are “curation, discussion and re-use” that describe the type of additional participation allowed. All three assume that those uses (e.g., tagging for a type of curation, commenting or trackbacking as discussion and referenced remix/embed/quote as re-use) are significant factors in social media, but the base criteria is always the significant individual participation.

Yes, Social Media is still Made of People.**

* Ayman helped define in a rare showing of, at once, comradery and patience!
** Damn it, we’re not the only ones that thought of this. Here’s the reference for those of you not raised on this particular culture’s trash.

NIN, Look here!

Ayman has disappeared so it seems like there’s nothing to stop Naaman from blabbering some more. And this time: yet-yet-another-another things-that-happened-to-be-on-my-browser-at-the-same-time. The difference is that they are even more related this time around…

First of all, congratulations to Mr. Kennedy! and me, Naaman! Yay for our recently-accepted WWW’09 paper, awesomely named “Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos” (Lyndon is a Rock Star). Just to make clear, that latter part about Lyndon is not part of the title, but its a fact nonetheless.

I will write more about the paper soon (and upload the paper as well), but here is part of the abstract:

We describe a system for synchronization and organization of user-contributed content from live music events. We start with a set of short video clips taken at a single event by multiple contributors, who were using a varied set of capture devices. Using audio fingerprints, we synchronize these clips such that overlapping clips can be displayed simultaneously. Furthermore, we use the timing and link structure generated by the synchronization algorithm to improve the findability and representation of the event content…

In other words, this work builds on social multimedia, those videos and photos that everybody now takes, and some share online, when they go to music shows, concerts, and any other public or private event. Like at an inauguration:

Capturing the moment
AP Photo/J. Scott Applewhite*

Yes, that’s the perfect photo to demonstrate the “Everybody capturing content” idea. And I just discovered it today.

In our paper, as the abstracts hints, we show how we can take those captures (in this case, videos) and use their audio track to synchronize them, creating a much-improved presentation and also improving on the metadata and organization of the content. In short, a perfect technology for Nine Inch Nails for use in their newly-launched (or re-launched?) website devoted to content from fans, from their years of concerts, which I re-discovered today. 10 videos from the each concert? In 5 years, you will have 100s. We’ll take care of them. How? I promised Sagee I will give him the details… coming soon.

* Photo reproduced in thumbnail size to maintain fair use and will be removed on demand; via the awesome Big Picture)

Lessig, Times, Colbert

Here’s another post in the series “things that are related mostly because they were on Naaman’s browser tabs at the same time”.

First tab: I do not agree with Ben the Practicalist saying that Lessig did a good job when he was talking with Stephen Colbert about the hybrid information economy, aka “read/write culture” or “remix culture”. I personally think that Lessig’s strategy of handling Colbert’s musing was not effective; the major points did not come through. Still, maybe more people will actually buy the book (I will).

The Colbert ReportMon – Thurs 11:30pm / 10:30c

So, which other tabs were open on my browser? As is often the case, tabs the reflected the move to new information economy:

  • An 1989 article from the New York Times about Ehud Banai was referenced in a documentary I was watching last week. Ehud is my favorite Israel artist, and I looked up the article (a review of one of Ehud’s early shows on his first US tour) when I got back home. The amazing thing here is the Times opening their full archive for free access. Probably one of the most comprehensive collections of information, available for remix (and linking) for free. The Times get it (still at the Times, Nick dug a couple of really old articles – I mean really old – for a story about the death of print for the Radar)
  • The excellent Hype Machine, already aggregating mp3 content from blogs around the world, has an also-excellent Top 10 album list – including full album listen for each. Whoever thinks this is bad for music and for the bands, didn’t read Lessig recently.

What am I saying? That we are already in a Lessig economy. More socialist, as Ben hopes? I don’t know yet.

Naaman Editing Wikipedia

As part of my class I am going to have my student edit a Wikipedia page of their choice. In preparation, I decided to do so myself – for the first time ever.

Why did I deserve this, then?



It happened when I tried to preview my first edit ever (no, it was not Ayman’s Wikipedia entry). I am not sure how many of my students will have the nerves to handle this kind of error messages! (OK, I did go back and click “preview” again, and then it worked).

Otherwise, editing was kind of ok. Took me a little bit to get the linking/link text model (simple) and understand the references format – both of which will not be trivial to a non-CS or not experienced person, I am afraid. Well, let’s see how my students end up doing (and which edits they choose to do!)[1].

[1] I would have told you mine but I’d like to keep my editing persona private for now. Or can someone figure it out otherwise?

Teaching retrieval of information: What do you leave out?

Hello, academia! One of Naaman’s new responsibilities is, of course, teaching. In the long run, the teaching load will be comprised of courses that are driven by my own research interest (Social Media class? Mobile Information?), as well as core courses in information science. I had tons of fun teaching Research Methods to a great group of MLIS students last semester. This spring, I will be teaching Retrieving And Evaluating Electronic Information to undergrads:

“In this course, students examine and analyze the information retrieval process in order to more effectively conduct electronic searches, assess search results, and use information for informed decision making. Major topics include search engine technology, human information behavior, evaluation of information quality, and economic and cultural factors that affect the availability and reliability of electronic information.”

Now there’s a topic that can had launched a thousand PhD theses… how do I pack it into one semester? As I see it, the class should be a combination of “how to” and “how it works”: both understanding how the technologies work and how to use it best (these are of course interrelated).

For now, I have the class set up in the following way (with thanks to Marie, Nina and Nick who taught this class before me):

First I will spend time discussion the basics of how to search. Starting from the very basic how to choose/iterate on keywords, through boolean operators and advanced search functions. Then, I will spend a few sessions talking about search technology, or how search engines work (you know, crawling/indexing/ranking). I believe that everyone should have an understanding of how search works in order to realize the bias and limitations inherent in the process. In the middle I will discuss the presentation of search results as well as the topic of browsers.

All this will take me 8-9 sessions (1.15 hours each) out of a total of about 25.

Then, beyond the generic search, there are other sources of retrieving information, even on the web. Directories, reference sites (e.g., dictionaries but many more) and business databases. Of course, specialized databases like, say, academic libraries and other digital libraries play a major role in this world, especially for university undergrads.

This concludes the very basic “what you need to know” about retrieving information. And about half the class sessions. But we’re only getting warmed up. Here are a dump of additional topics I am planning to cover: news, breaking information and tracking topics (alerts and feeds); Web Reference Tools (from Wikipedia to Yahoo Answers), which of course leads to the topic of information reliability; publishing information on the Web; economics of information (here’s another topic that can last a few semesters); legal aspects of information use (e.g., copyright issues, Creative Commons); bookmarking and knowledge collections; social media and blogs, and Multimedia search (of course).

This, together with student presentations and exams, will pretty much conclude the class. But there’s so much else one can cover… here’s what I left out for now: ethical and cultural aspect of information; information overload; mobile information retrieval; the semantic web (ok, “semantics on the web” maybe a better title); personal information management (e.g., Stuff I’ve Seen); non-text retrieval (e.g. location-driven information); the hidden web; Web of Data; phew!

Now, our undergraduate program at SCILS offers classes that touch in depth on many of these issues. But can I possibly leave these topics out from any basic retrieval of information class? Doesn’t everyone need to know about these? Is there anything else I left out that must be covered?

Whatever form the class takes, I am excited. Mostly, I am curious to see what undergrads these days know about search, and how their perceptions can change in 14 short weeks.

Thanksgiving Post: Naaman’s Academic Ancestors

I am told that Thanksgiving is not just about eating (I’ve grown to really like Thanksgiving food) but also football. And family competitions. Oh, and (at least American) ancestry. Which is funny, because this week I discovered that my academic ancestry can be traced back to the late 1600s or even earlier.

Yes, I am an academic now, I am allowed to be interested in these things.

My academic grandfather, Gio Wiederhold, has been maintaining a (up-and-down) tree on his page, meaning we can examine his academic decedents as well as his ancestors. Recently, Panos (who is behind the enemy lines) also published a slightly better-formatted version of the tree, going back in time (no decendents yet). Panos is an academic “nephew” of mine – did I just made up a new concept? – his academic grandfather was my advisor, which means we share most of our tree…

What do we learn from the tree? Horrible truths. Well, at least one: turns out my lineage includes a relative recent component of Ayman’s own school! Northwestern’s Carl Porter Duncan (4 generations back), was a psychology professor at Northwestern. Duncan was the PhD advisor of John Amsden Starkweather (three generations), who was Gio’s advisor and the first in my lineage to work on “computer science” topics (at the Departmet of Psychiatry in UCSF).

It’s not surprising that my roots are in psychology given that computer science did not exist as a field before, say, the 1950s (information science in some forms goes way back, but modern information departments are relatively new). My psychology roots go far back, and include the first psychology professor in the US (8 generations) and one of the “fathers of modern psychology” (9 generations). It gets murkier from there, but basically a bunch of physicians show up, mostly in Leipzig, and finally we run into Otto Mencke (18 generations).

And, other than my esteemed direct ancestors, in the expanded family tree you will also find Carl Friedrich Gauss and David Hilbert, if you are willing to go way back. Which does not reflect any of Naaman, unfortunately, but I still think it is cool.

Who’s in your tree, Ayman?

Otto Mencke – any similarity?

On Persuasion

Naaman is being persuaded – and even more excitingly, serving as a bridge for persuasion! BJ Fogg and gang had long pointed out the potential of Facebook (and social media) as a persuasive platform. The Causes application on Facebook had been using persuasive elements from the start, and has a significant following (20,000,000 monthly active according to Facebook stats). Causes is cleverly using the social influence potential of Facebook to draw people into supporting various non profits and similar efforts (critics and cynics would say “to satisfy one’s concious while sitting comfortably at a computer screen”). I don’t know when they started to use the birthday information on Facebook, but that’s smart, too: the birthday is arguably the one day a year I have any kind of influence over my friends, if any… right Ayman? Here’s what I got:

Happy (Almost) Birthday!

Thanks to Facebook, in two weeks all of your friends will see that it’s your birthday. Instead of just writing on your wall, or giving you something you don’t need, what if they had a chance to help a cause you believe in? Whether you want to raise money for clean water in Ethiopia, vaccinations for children in Haiti, or a safe home for a puppy in Mississipi, with a Birthday Cause your friends can give in honor of your special day.

Select your Birthday Cause today: Get StartedLearn More

Have a very happy birthday,
The Causes Team

That’s pretty smart. What should I choose? And yes, my birthday is coming up!

In other persuasion news, using YouTube this time (thanks Sagee), Monty Python wants to both free their content and get you to pay for it. Pretty cool assuming they will make all content available and will not fight fans that upload content that the MP’s do not want on the site.