Category Archives: Uncategorized

Online again

This blog, this Ayman Naaman Show, was started by two young researchers as a way to collaborate and post in a shared thought space. Over the years, our activity waned and moved to likes and retweets on various other social media platforms. Eventually, a bad config halted the WordPress activity till one of us had the time to fix it. Well it’s fixed now (thank you Dreamhost support). Feel relieved Naaman? There’s some backlog of content…maybe we can uncage it?

Radio Silence Over: Updates, Mahaya, TimeSpace, Moscow

Ayman has been on my case, and for a good reason this time. We kind of neglected you, good readers of our blog. It’s been a long and winding few months years. We both fully intend to write more but for now, here’s a quick update from the Naaman half. And it’s exciting (at least for me).

The quick update, for those who don’t know, is that I have co-founded a company called Mahaya which aims to organize the world’s memories: make sense of the world’s stories and events as they are shared on social media. We are currently beta testing a new product called Seen, which automatically makes it fun and simple to see what happened anywhere. This week, The New York Times announced that Mahaya will be one of the three companies in the inaugural run of the TimeSpace program (whoever named it should receive the Pulitzer).

In related news, next week in Moscow, I will be giving a keynote at ECIR 2013, talking about how the work we have done in the last 8 or so years have informed the vision (and technology) for Mahaya. Here’s the motivation for the talk below. I will try to post the full notes after I give the talk (Ayman, keep me honest here).

Time for Events.

In the last 8 years, my work and research had focused on the ways in which social media reflects and interacts with “the real world” — by which I mean actual occurrences, atoms clashing, people performing acts that are tied to a specific location and, often, to a time.

2005 was the onset of location-based social media as we know it. Flickr got popular (and got acquired by Yahoo). In 2006, Flickr formally introduced geotagging by supporting geo-metadata and providing a map interface; they thus created an easy way for people to associate location data with content, at scale. Almost immediately, we had… lots of dots on on a map! Surely, we thought, these dots can tell us more about the world than where photos were taken. Can they tell us *what* are the most interesting places/landmarks, instead?

Tag Maps was our attempt at Yahoo! Research Berkeley to do that. For any world region, any zoom level, we extracted (using fairly simple IR tools), the most salient and important topics for that area; we built an interactive prototype that exposed this information, a video of which you can find here (see if you can spot Yoda!). We realized (read more here) that one could extract a strong signal about the real world, about people, their geographic activities and interests from social media data.

Tag Maps / World Explorer Demo from Mor on Vimeo.

We then noticed a funny entry on the Paris Tag Map. It read “Les Blogs”; an explanation can be found here: a bunch of bloggers at a conference, posting Flickr photos until our algorithm thought this was the main descriptor for that area (and Paris). In other words, events started showing up on our map. That got us thinking: can we do a better job modeling, identifying and presenting the data that is specifically associated with events?

tagmaps paris

At SIGIR 2007 we showed that the answer is yes. With Tye and Nathan, we described a system that discovers real-world events from Flickr geotagged data, including hyper-local events such as BYOBW (an old favorite for me to show in talks, and an event I literally learned about from our results). The takeaway? social media can reflect real-world events, via content created by a collective of mostly uncoordinated contributors.

After Tahrir square these “discoveries” seem rather obvious, but that was not the case in 2007, before Facebook and Twitter gained any mainstream popularity, and well before iPhone popularized media and location (iPhone 3G was released in July 2008).

In my talk, I am going to address the challenges in developing event technologies, show some of the solutions and technologies we developed in my research, complain that in 2013 that problem is not yet solved commercially (case in point: the link I had to use for BYOBW above), and give a demo of Mahaya’s recent product, Seen, where we start solving the “event problem”. I’ll also talk about social media as the next step in the evolution of information systems, and what it means for Information Retrieval.

Come and say hi if you are in Moscow next week!

Cheer Up! Some Holiday Hacking

With my star undergrads Ian and Abe, and backend support from Ziad, we put together this mashup for the holidays! We use the data from the Twitter streaming crawler we built (for our NSF-funded work) to get Instagram photos posted on Twitter that have the word Christmas in the tweet, and where the photo location is available on Twitter. We then add the Google Streetview of the photo location and, well, mash them all together.

Cheerbeat Screenshot

The result is an interesting juxtaposition (as one comment on my Facebook post captured well) of the “small instagram-style photos (typically close-up, indoors) against the backdrop of the (typically distant, outdoors) Google street views”. As such, the StreetView gives context to the Instagram photo and maybe provides the settings in which the activity in the photo is taking place, another dimension of understanding, often much stronger than the text of the tweet itself.

Cheerbeat Screenshot

The app is also an interesting (and mostly unintended) statement about privacy — I don’t know what these users would feel like knowing their environment is exposed to all, and not just in a default bland zoomed-in map format.

Cheerbeat Screenshot

The Cheerbeat application (instacheer was our first name choice but, perhaps not amazingly, already taken with another Instragram Christmas mashup!) mostly runs as javascript in the browser. We continuously crawl Twitter data using the streaming API on our server. When the app loads, it grabs from our server a .json file with the latest 250 tweets with “Christmas”, “” that have geo coordinates that are not empty. We then (in the browser) use the Google Streetview API to find which of these insta-tweets’ locations are available. The app then rotates through the tweets/photos showing the tweet, picture, location, time and Streetview of each.

As a side note, after all this filtering,  surprisingly little data satisfied all these criteria, mostly (I suspect) because Twitter requires specific user authorization for location information to be posted in tweets. In other words, even though many (most?) Instagrams will have location data, a lot of those will not have their data available when posted on Twitter.

There are extra features coming for this app (e.g., choosing your own keywords), but more on that later.

Happy holidays and enjoy the beat!


Putting on a SMILe (Plus: Winners!)

They say academia is the art of becoming world-renown without appearing to be self-promoting. Sometimes, however, you gotta make some noise. In our case, we (that’s my team and I, don’t blame Ayman) have recently launched a new lab, the Social Media Information Lab. We thought we’d like to get the word out, especially as we are looking for new PhD students (and maybe postdocs) to join our ranks.

As the CHI 2011 conference is the most popular conference that matches our research area, we decided to do something for it. It also helps that CHI has traditionally been a very playful gathering, with people allowing their badges to be decorated with a host of badges (formal and informal), stickers, puppets, and various other household items. Love the CHI academics. We decided to have a little game.

With our convenient lab name acronym, SMIL (perhaps not accidental), we zeroed in on a Smile theme pretty quickly. We picked four exceptionally smily CHI luminaries as our SMILe ambassadors: Ben Shneirderman, Judy Olson, Elizabeth Churchill, and Ed Chi. The fantastically talented Funda Kivran-Swaine has turned their smily regular pictures into a monochromatic image (Ed now carries his proudly on his Twitter profile), which we printed on some 1000 stickers using the wonderful-yet-pricy Zazzle service. Of course, the stickers included the URL of the SMIL website.

From left: Judy Olson, Ed Chi, Elizabeth Churchill, Ben Shneiderman

We devised a conference game, with very simple mechanics: collect all four heroes on your badge, post it on Flickr/Twitter (#chismil) and you have a chance to win a CHI-SMIL t-shirt. We also made it somewhat difficult: different team members (and friends) distributed different stickers, and Ed’s sticker was the most rare, and access to it tightly controlled by Funda only.

Did it work? We think it did. Soon enough, people I didn’t know approached me begging for “A Judy Olson” (or some other sticker), and a rumor was start that there is a secret, fifth member.

Second, the luminaries themselves were great sports, and seem to enjoy the commotion and exchanges around the stickers. They each had a roll of their kind, except for Ed of course (access controlled to the end!).

In addition, people went to our website and commented on it to me (and perhaps to others).

And, finally, many people labored to collect all four stickers! (partial set of images). We put names in a hat, drew them out, and have five lucky winners. There you go, people. T-shirts are coming. You’re welcome.

Stay tuned for CHI 2012. Who knows what games will be played.

Talk with Me (a.k.a. Wake me Up)

If you are reading this and live in the same great city as my good friend Dr. Naaman, you should go to the opening of the Talk to Me show at the MOMA, July 24th 2011. From their blog, they say:

Talk to Me is an exhibition on the communication between people and objects…It will feature a wide range of objects from all over the world, from interfaces and products to diagrams, visualizations, perhaps even vehicles and furniture, by bona-fide designers, students, scientists, all designed in the past few years or currently under development.

A year ago, I had the good fortune of meeting Paola Antonelli, the curator of Architecture and Design at the NY MOMA. She was describing to me this show, which was in its infancy at the time. So I’m excited to see it actually open and terribly sad I won’t be able to make the opening. We chatted for a little bit about the semantic difference between “Talk to Me” and “Talk with Me” (my research is focused more so on the latter). Quite a few months later, someone told me this quote by Ben Shneiderman: “the old computing is about what computers can do, the new computing is about what people can do.”

Recently, thinking about technology that people talk with, my friend Jeffery Bennett and myself entered a Web-of-things Hack-a-thon, part of Pervasive Computing. Our idea was simple. Can we enable an every day object to reuse the asynchronous status update on Facebook and Twitter to connect with someone in a meaningful, real-time way? Enter The REAWAKENING.

We thought to call it 'Sleeper Cell' too.

Quite simply, The REAWAKENING is a socially connected alarm clock. We used a old skool Chumby (quite possibly one of the best prototyping tools ever made) to make our clock which is tied into the Facebook and Twitter platforms. The REAWAKENING works like any other alarm clock. You set it and you go to sleep. When the alarm goes off, you can turn it off and wake up. But seriously, who does that? So, the alarm goes off, and you hit snooze and go back to bed. The snooze button gives you an extra 8.5 minutes of sleep, at the same time, The REAWAKENING posts your snooze to Facebook and Twitter:

If five (5) of your friends follow the link from the snooze post, the alarm will fire again on the clock, preempting your 8.5 minute snooze. And this cycle can continue if you hit snooze again. When you do finally wake up and turn off the alarm, your friends are notified:

There’s plenty of places for The REAWAKENING to go like shaking the clock can message your friends back to stop or you can ‘auto alarm’ to wake up when your friends nearby are going to wake up; don’t be surprised if you see it in an app-store near you. More importantly, as we continue to invent and build out a connected world, lets continue expand the people and things we talk to and who we talk with.

Using Sociology(!) to Explain Unfollows on Twitter

What gives, @ayman is no longer following me on Twitter!

Well, he still does, not least because he knows I will send roadkill to his office address if he stops. But surely, people stop following one another on Twitter all the time. Right? Right? Yes, right, as we show in our recent paper (caution, PDF), with my PhD students Funda Kivran-Swaine and Priya Govindan, to be published at CHI 2011.

Many studies, in academia and industry, in computer science and sociology (this one too), examine creation of new ties in social networks, but very few examine tie breaks and persistence. Why? One reason is that, in computer science, models of tie creation have immediate consequences for systems (e.g., recommending new contacts). Another reason is that tie breaks are rare, or hard to detect/define in many social networks, especially those networks studied by sociologists (when does Naaman’s tie with Ayman break? after 3 years on not communicating? 20?). Ron Burt‘s work is an exception, but Ron is always an exception, isn’t he.

Enter Twitter, where we can witness a dynamic social system, and where ties are created and broken for all to observe. Op-por-tu-ni-ty! Can we shed some light on the tie break phenomena in Twitter? How wide-spread is this phenomena, and what are the factors that can help predict tie breaks?

We started with a random set of 715 Twitter users, and the 245,586 Twitter users that “follow” them at Time 1 (July 2009). We looked at these users and followers again after nine months (April 2010, Time 2). Did these follow edges still exist? How many dropped over that period? The image below captures one of our 715 users, the network around them in Time 1. Those users that stopped following our user at Time 2 (the “unfollowing” users) and their connections are marked in blue. Now it’s time to pause and see what you think the overall drop “unfollow” is in our data: 5%? 15%? 25%? 75%? OK, scroll down.
Unfollowing on Twitter.
Turns out, over nine month, 30% of the follow edges disappeared. On average, a single user lost about 39% of their followers over that period. How come it’s not 30%? Because the 39% is an average of averages; probably due to the fact that people with a large number of followers — of which there are fewer — lost a smaller portion of their followers, but still a large number. Does more followers mean relatively fewer unfollowers? I’ll come back to that in a second.

For this work, we were mainly interested in looking at whether well-known sociological processes are in play on Twitter in respect to unfollowing activity. So we did our lit review, and discovered that strength of ties, embeddedness within networks, and power/status are some of the key related sociological concepts (the paper explains those in detail, of course). The question then was: can we look at the network structure alone, and based on these theories, see if there are network factors that are highly correlated with unfollows?

The details of the dataset are in the paper, but for now, just imagine that for each “follow” relationship, we had the complete network graph of both nodes. So if “@ayman following @informor” was one of the edges we looked at, we could get the entire network neighborhood of @informor, and @ayman. (This network data is presented to you courtesy of Kwak et al.). What properties of @informor’s network, and of the network around @informor and @ayman, correlate with higher probability that @ayman would stop following me?

We calculated a bunch of variables, including for example, for each of our 715 initial users (let’s call them “seeds”):

  • The seed’s number of followers.
  • The seed’s clustering coefficient: how connected their followers are.
  • The seed’s reciprocity rate: what portion of the people following them, they follow back?
  • The seed’s follow-back rate: what portion of the people they follow, follow them back?
  • The seeds follower-to-friends ratio.

And for each seed and follower pair in our data, we computed aspects of their relationship:

  • How many connections they have in common (i.e., users the seed and follower both connect to)?
  • What is the different in prestige between the two (in terms of number of followers)?
  • Does the seed reciprocate the connection to the follower?

So, which factors correlated most with unfollow activity? We ran quite a sophisticated analysis (multi-level logistic regression), but I’ll keep it simple for here with a basic analysis of the factors that our analysis had shown to contribute to the probability that a follower will unfollow a seed. For the more “scientific” study, check out the paper.

First, what did *NOT* have impact: the number of followers a seed had at Time 1 had very limited impact on the probability of unfollows for that seed, and that impact was mitigated by other factors. A figure (limited to seeds who had less than 500 followers) demonstrates this.
num followers

So what played a major role? Reciprocity, for one, did. Do you follow someone that follows you? If you do, they are much less likely to unfollow you. Remember our 245,586 connections? Half of them were reciprocated (the seed also followed their follower). When the relationship was reciprocated, 16% of the followers unfollowed. When it wasn’t, a whopping 45% did. Before I throw a figure in, an important note about causality: we don’t know the causality. For example, pairs of users who are closer in real life (“strong ties”) are likely to have a reciprocated relationship and of course, their connection is not likely to break (because they are close). A deeper examination is needed to show whether the reciprocity act *alone* helps in maintaining the tie, although the analysis in the paper suggests that it contributes more than other factors that typically signify strong relationships.


We can even look at the user’s tendency to reciprocate follow relationship, and its effect on the percent of followers they lose:

Here’s one more thing to think about: a user’s follow-back rate was highly correlated with a lower ratio of unfollows, but the ratio of followers to followees wasn’t. The follow-back rate is portion of the people a user follows that follow them back. For example, I may have 15 followers and 10 followees (people I follow) on Twitter. Out of the people I follow, 8 follow me back. So my follow-back rate is 80%, and my follower-to-followee ratio is 1.5. Both these metrics are potential measure of “importance” on Twitter, but the fact that only one — the follow-back rate — impacts the rate in which people stop following me, hints that the follow-back rate might be a better measure of importance and success on Twitter. Makes sense, Ayman? What’s your follow-back rate?

Unfollowing on Twitter: followback rate.

What else? the embeddedness is the last thing I will touch on, you can read the paper for more (it’s only a 4 pager, don’t be too easy on yourself). And by embeddedness I do not mean the number of YouTube videos you post on your Twitter stream, but the sociological concept that captures set of relationships that exists between the individuals in a relationship through third parties (i.e., common friends). More common friends? Your relationship is presumed to be stronger. It is not a surprise, then, that the larger the number of common neighbors two Twitter users have, the less likely one is to unfollow the other. From our data, this figure shows, for each level of common neighbors a “follow” relationship had, what percent of these follows became “unfollows”. For example, from all follow relationships that had no common neighbors at Time 1, 78% did not exist at Time 2; one common neighbor was enough to drop that number to 46% (and it keeps dropping — I stopped at 15 because you get the idea).

common neighbors

What didn’t we look at? Pretty much everything else! We relied on network structure alone to investigate these unfollows, as a first step. But there’s a lot more: how often do you tweet (or not)? How interesting are your posts? How similar your topics are to the people following you? We are now exploring all these factors and additional variables. Stay tuned.

[update: slideshare presentation here].


Last month, John Forsythe and myself made a Chumby app called ShakeMe. The basic idea is like the Folding@Home or Seti@Home projects, where people lend their CPU cycles for some scientific research. The major difference is we don’t want CPU cycles, we collect sensor data from accelerometers to make a sensor mesh of seismographic activity. We submitted the idea to Freescale Electronics “Sense the world” contest.

If you dig this idea, vote for us! This is a two step process on Facebook.

  1. Like Freescale on Facebook here.
  2. Like our video on Facebook here.

I hear Naaman voted three times, Voting closes December 10th, so vote soon!

Twitter Sentiment Dataset Online

Late last year, Nick Diakopoulos and myself analysed the sentiment of tweets to characterize the Presidential debates. You can read about it in this paper. For this work, we collected sentiment judgements on 3,238 tweets from the first 2008 Presidential debate.

Today, we’ve decided to post the data online for everyone. Just a few notes before we do:

  1. Twitter owners own their tweets.
  2. The sentiment judgements are free for non-commercial, educational, artistic, and academic usage.
  3. The tweets were all publicly posted.
  4. This data was collected via their search API in 2008; read this paper for details on how.
  5. Sentiment judgements were fetched from Mechanical Turkers; read this other paper for details.
  6. Be responsible in your work using this data.

Creative Commons LicenseWe are releasing this under a Creative Commons license. Dataset for Characterizing Debate Performance via Aggregated Twitter Sentiment by Nicholas Diakopoulos and David A. Shamma is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The data set is available as a compressed tab-separated file [here’s the ZIP download link]; give us a shout here as a comment if you use it somewhere. Enjoy!


This dataset is now on InfoChimps.

Speaking in ML

Back to school Naaman? It has been a long summer. I had the pleasure of working with Jude Yew (you will enjoy the stylish cartoon drawing of himself) from the School of Information in Ann Arbor Michigan. We began the summer thinking about social networks and media sharing. We decided not to look at Twitter. Instead we looked back at Yahoo! Zync. We began to examine videos that were shared over IM in sync, how they were watched, and when people scrubbed. This became rather interesting and led us to ask questions about how we watch and consume and perceive videos.

To back up some, we started to look at videos just from YouTube. How they were classified. And how we could predict classification based on the video’s metadata. It turns out…its hard. We had a small dataset (under 2,000 videos) and getting a bigger crawl and throwing the data in the cloud was…well…just gonna take a little time. I get a little impatient.

We were using Naive Bayes to predict if a video was: Comedy, Music, Entertainment, Film, or News. The YouTube meta data had three features: the video length, the number of views, and the 5 star rating. We wondered about how people rate movies. Some B and even C movies are cult classics. They belong to a class of like media. It doesn’t say that a particular B movie isn’t as good as a particular A movie. If this is in fact the case, the set of 4.2 rated YouTube videos could be fit to a polynomial anywhere. In effect, they do not need to be before 4.5 and after 4.0. Technically put, the ratings of 0.0 to 5.0 could be transformed from interval to factors. With factorization, Naive Bayes has more freedom to fit polynomials to probabilistic distributions.

Only when we nominally factor the ratings can we classify videos on YouTube using only three features. Compared to random predictions with the YouTube data (21% accurate), we attained a mediocre 33% accuracy in predicting video genres using a conventional Naive Bayes approach. However, the accuracy significantly improves by nominal factoring of the data features. By factoring the ratings of the videos in the dataset, the classifier was able to accurately predict the genres of 75% of the videos.

The patterns of social activity found in the metadata are not just meaningful in their own right, but are indicative of the meaning of the shared video content. This was our first step this summer in investigating the potential meaning and significance of social metadata and its relation to the media experience. We’ll be presenting the paper Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification (pdf) at IS&T/SPIE in January. Stop by and say hi if you see one of us there!

Who, What, When, Where: The Semantic Web is Alive and Well (and on Facebook)

I have killed the semantic web before (at least in my provocative title), but pointed out that the future of semantics are light-weight semantics created by programmers, users or individual companies. And here it comes: the future of the Semantic Web (and by that I also mean the Web, the life and the Universe) is now owned by Facebook.

A recent Yahoo! patent, dug up by SEO by the Sea reminded me of the work I’ve been involved with at Yahoo!, driven by the vision of Marc Davis: being able to semantically connect the four most important dimensions of Web objects, Who, What, When and Where, directly to the user experience on Yahoo!. But while Yahoo! dragged its feet, Facebook is making real steps to becoming the true W4 platform for the Web. The identity (Who) war at least seems to have been won, at least for the time being; for most people, the real identity on the Web is the one they expose on Facebook. Controlling the Who has immediate consequence (e.g., de-facto communication platform for people trying to reach contacts), but had also allowed Facebook to expand into the When (Events), What (Pages) and now Where (Places). And as I am doing the linking here, I notice the Facebook title for the Places page — interesting…

Facebook W4

In other words, the Facebook W4 network allows people to connect their experiences to well-defined concepts that “live” in the Facebook objectverse. This is one of Facebook’s greatest successes, and greatest leverage going forward.

Going forward means allowing other developers and companies to build on the Facebook W4 semantics. Yahoo! only partially succeeded in doing that with “Where”, using the Yahoo! Geo platform. Facebook now allows Websites and applications to connect via the Who (Facebook identity). Increasingly, Facebook will increase the usefulness of there “What” and “When” for other applications. The Places feature, cleverly, was already launched with integration of various companies (e.g., FourSquare) that can use the Facebook Places platform. There is no reason why this platform will soon be open (and used) by many other developers, giving Facebook ownership of Who and Where on the Web.

Going forward also means improving the capabilities of the Facebook platform in connecting and mashing the various entities. For example, to be able to record the fact that I “this picture was taken in the event Elvis Perkins in Dearland at Governor’s Island with my friend Kathleen”. Seems like that may be coming! Many other applications are of course possible (e.g., “all the Statuses ever posted from this classroom”).

And where is Twitter? With the less specific “annotation” feature, and lagging behind in the Who space, Twitter is struggling in the  objectverse, despite a strong geo-bend and a major push last year.