Using Sociology(!) to Explain Unfollows on Twitter

What gives, @ayman is no longer following me on Twitter!

Well, he still does, not least because he knows I will send roadkill to his office address if he stops. But surely, people stop following one another on Twitter all the time. Right? Right? Yes, right, as we show in our recent paper (caution, PDF), with my PhD students Funda Kivran-Swaine and Priya Govindan, to be published at CHI 2011.

Many studies, in academia and industry, in computer science and sociology (this one too), examine creation of new ties in social networks, but very few examine tie breaks and persistence. Why? One reason is that, in computer science, models of tie creation have immediate consequences for systems (e.g., recommending new contacts). Another reason is that tie breaks are rare, or hard to detect/define in many social networks, especially those networks studied by sociologists (when does Naaman’s tie with Ayman break? after 3 years on not communicating? 20?). Ron Burt‘s work is an exception, but Ron is always an exception, isn’t he.

Enter Twitter, where we can witness a dynamic social system, and where ties are created and broken for all to observe. Op-por-tu-ni-ty! Can we shed some light on the tie break phenomena in Twitter? How wide-spread is this phenomena, and what are the factors that can help predict tie breaks?

We started with a random set of 715 Twitter users, and the 245,586 Twitter users that “follow” them at Time 1 (July 2009). We looked at these users and followers again after nine months (April 2010, Time 2). Did these follow edges still exist? How many dropped over that period? The image below captures one of our 715 users, the network around them in Time 1. Those users that stopped following our user at Time 2 (the “unfollowing” users) and their connections are marked in blue. Now it’s time to pause and see what you think the overall drop “unfollow” is in our data: 5%? 15%? 25%? 75%? OK, scroll down.
Unfollowing on Twitter.
Turns out, over nine month, 30% of the follow edges disappeared. On average, a single user lost about 39% of their followers over that period. How come it’s not 30%? Because the 39% is an average of averages; probably due to the fact that people with a large number of followers — of which there are fewer — lost a smaller portion of their followers, but still a large number. Does more followers mean relatively fewer unfollowers? I’ll come back to that in a second.

For this work, we were mainly interested in looking at whether well-known sociological processes are in play on Twitter in respect to unfollowing activity. So we did our lit review, and discovered that strength of ties, embeddedness within networks, and power/status are some of the key related sociological concepts (the paper explains those in detail, of course). The question then was: can we look at the network structure alone, and based on these theories, see if there are network factors that are highly correlated with unfollows?

The details of the dataset are in the paper, but for now, just imagine that for each “follow” relationship, we had the complete network graph of both nodes. So if “@ayman following @informor” was one of the edges we looked at, we could get the entire network neighborhood of @informor, and @ayman. (This network data is presented to you courtesy of Kwak et al.). What properties of @informor’s network, and of the network around @informor and @ayman, correlate with higher probability that @ayman would stop following me?

We calculated a bunch of variables, including for example, for each of our 715 initial users (let’s call them “seeds”):

  • The seed’s number of followers.
  • The seed’s clustering coefficient: how connected their followers are.
  • The seed’s reciprocity rate: what portion of the people following them, they follow back?
  • The seed’s follow-back rate: what portion of the people they follow, follow them back?
  • The seeds follower-to-friends ratio.

And for each seed and follower pair in our data, we computed aspects of their relationship:

  • How many connections they have in common (i.e., users the seed and follower both connect to)?
  • What is the different in prestige between the two (in terms of number of followers)?
  • Does the seed reciprocate the connection to the follower?

So, which factors correlated most with unfollow activity? We ran quite a sophisticated analysis (multi-level logistic regression), but I’ll keep it simple for here with a basic analysis of the factors that our analysis had shown to contribute to the probability that a follower will unfollow a seed. For the more “scientific” study, check out the paper.

First, what did *NOT* have impact: the number of followers a seed had at Time 1 had very limited impact on the probability of unfollows for that seed, and that impact was mitigated by other factors. A figure (limited to seeds who had less than 500 followers) demonstrates this.
num followers

So what played a major role? Reciprocity, for one, did. Do you follow someone that follows you? If you do, they are much less likely to unfollow you. Remember our 245,586 connections? Half of them were reciprocated (the seed also followed their follower). When the relationship was reciprocated, 16% of the followers unfollowed. When it wasn’t, a whopping 45% did. Before I throw a figure in, an important note about causality: we don’t know the causality. For example, pairs of users who are closer in real life (“strong ties”) are likely to have a reciprocated relationship and of course, their connection is not likely to break (because they are close). A deeper examination is needed to show whether the reciprocity act *alone* helps in maintaining the tie, although the analysis in the paper suggests that it contributes more than other factors that typically signify strong relationships.


We can even look at the user’s tendency to reciprocate follow relationship, and its effect on the percent of followers they lose:

Here’s one more thing to think about: a user’s follow-back rate was highly correlated with a lower ratio of unfollows, but the ratio of followers to followees wasn’t. The follow-back rate is portion of the people a user follows that follow them back. For example, I may have 15 followers and 10 followees (people I follow) on Twitter. Out of the people I follow, 8 follow me back. So my follow-back rate is 80%, and my follower-to-followee ratio is 1.5. Both these metrics are potential measure of “importance” on Twitter, but the fact that only one — the follow-back rate — impacts the rate in which people stop following me, hints that the follow-back rate might be a better measure of importance and success on Twitter. Makes sense, Ayman? What’s your follow-back rate?

Unfollowing on Twitter: followback rate.

What else? the embeddedness is the last thing I will touch on, you can read the paper for more (it’s only a 4 pager, don’t be too easy on yourself). And by embeddedness I do not mean the number of YouTube videos you post on your Twitter stream, but the sociological concept that captures set of relationships that exists between the individuals in a relationship through third parties (i.e., common friends). More common friends? Your relationship is presumed to be stronger. It is not a surprise, then, that the larger the number of common neighbors two Twitter users have, the less likely one is to unfollow the other. From our data, this figure shows, for each level of common neighbors a “follow” relationship had, what percent of these follows became “unfollows”. For example, from all follow relationships that had no common neighbors at Time 1, 78% did not exist at Time 2; one common neighbor was enough to drop that number to 46% (and it keeps dropping — I stopped at 15 because you get the idea).

common neighbors

What didn’t we look at? Pretty much everything else! We relied on network structure alone to investigate these unfollows, as a first step. But there’s a lot more: how often do you tweet (or not)? How interesting are your posts? How similar your topics are to the people following you? We are now exploring all these factors and additional variables. Stay tuned.

[update: slideshare presentation here].


Last month, John Forsythe and myself made a Chumby app called ShakeMe. The basic idea is like the Folding@Home or Seti@Home projects, where people lend their CPU cycles for some scientific research. The major difference is we don’t want CPU cycles, we collect sensor data from accelerometers to make a sensor mesh of seismographic activity. We submitted the idea to Freescale Electronics “Sense the world” contest.

If you dig this idea, vote for us! This is a two step process on Facebook.

  1. Like Freescale on Facebook here.
  2. Like our video on Facebook here.

I hear Naaman voted three times, Voting closes December 10th, so vote soon!

Twitter Sentiment Dataset Online

Late last year, Nick Diakopoulos and myself analysed the sentiment of tweets to characterize the Presidential debates. You can read about it in this paper. For this work, we collected sentiment judgements on 3,238 tweets from the first 2008 Presidential debate.

Today, we’ve decided to post the data online for everyone. Just a few notes before we do:

  1. Twitter owners own their tweets.
  2. The sentiment judgements are free for non-commercial, educational, artistic, and academic usage.
  3. The tweets were all publicly posted.
  4. This data was collected via their search API in 2008; read this paper for details on how.
  5. Sentiment judgements were fetched from Mechanical Turkers; read this other paper for details.
  6. Be responsible in your work using this data.

Creative Commons LicenseWe are releasing this under a Creative Commons license. Dataset for Characterizing Debate Performance via Aggregated Twitter Sentiment by Nicholas Diakopoulos and David A. Shamma is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The data set is available as a compressed tab-separated file [here’s the ZIP download link]; give us a shout here as a comment if you use it somewhere. Enjoy!


This dataset is now on InfoChimps.

Speaking in ML

Back to school Naaman? It has been a long summer. I had the pleasure of working with Jude Yew (you will enjoy the stylish cartoon drawing of himself) from the School of Information in Ann Arbor Michigan. We began the summer thinking about social networks and media sharing. We decided not to look at Twitter. Instead we looked back at Yahoo! Zync. We began to examine videos that were shared over IM in sync, how they were watched, and when people scrubbed. This became rather interesting and led us to ask questions about how we watch and consume and perceive videos.

To back up some, we started to look at videos just from YouTube. How they were classified. And how we could predict classification based on the video’s metadata. It turns out…its hard. We had a small dataset (under 2,000 videos) and getting a bigger crawl and throwing the data in the cloud was…well…just gonna take a little time. I get a little impatient.

We were using Naive Bayes to predict if a video was: Comedy, Music, Entertainment, Film, or News. The YouTube meta data had three features: the video length, the number of views, and the 5 star rating. We wondered about how people rate movies. Some B and even C movies are cult classics. They belong to a class of like media. It doesn’t say that a particular B movie isn’t as good as a particular A movie. If this is in fact the case, the set of 4.2 rated YouTube videos could be fit to a polynomial anywhere. In effect, they do not need to be before 4.5 and after 4.0. Technically put, the ratings of 0.0 to 5.0 could be transformed from interval to factors. With factorization, Naive Bayes has more freedom to fit polynomials to probabilistic distributions.

Only when we nominally factor the ratings can we classify videos on YouTube using only three features. Compared to random predictions with the YouTube data (21% accurate), we attained a mediocre 33% accuracy in predicting video genres using a conventional Naive Bayes approach. However, the accuracy significantly improves by nominal factoring of the data features. By factoring the ratings of the videos in the dataset, the classifier was able to accurately predict the genres of 75% of the videos.

The patterns of social activity found in the metadata are not just meaningful in their own right, but are indicative of the meaning of the shared video content. This was our first step this summer in investigating the potential meaning and significance of social metadata and its relation to the media experience. We’ll be presenting the paper Know Your Data: Understanding Implicit Usage versus Explicit Action in Video Content Classification (pdf) at IS&T/SPIE in January. Stop by and say hi if you see one of us there!

Who, What, When, Where: The Semantic Web is Alive and Well (and on Facebook)

I have killed the semantic web before (at least in my provocative title), but pointed out that the future of semantics are light-weight semantics created by programmers, users or individual companies. And here it comes: the future of the Semantic Web (and by that I also mean the Web, the life and the Universe) is now owned by Facebook.

A recent Yahoo! patent, dug up by SEO by the Sea reminded me of the work I’ve been involved with at Yahoo!, driven by the vision of Marc Davis: being able to semantically connect the four most important dimensions of Web objects, Who, What, When and Where, directly to the user experience on Yahoo!. But while Yahoo! dragged its feet, Facebook is making real steps to becoming the true W4 platform for the Web. The identity (Who) war at least seems to have been won, at least for the time being; for most people, the real identity on the Web is the one they expose on Facebook. Controlling the Who has immediate consequence (e.g., de-facto communication platform for people trying to reach contacts), but had also allowed Facebook to expand into the When (Events), What (Pages) and now Where (Places). And as I am doing the linking here, I notice the Facebook title for the Places page — interesting…

Facebook W4

In other words, the Facebook W4 network allows people to connect their experiences to well-defined concepts that “live” in the Facebook objectverse. This is one of Facebook’s greatest successes, and greatest leverage going forward.

Going forward means allowing other developers and companies to build on the Facebook W4 semantics. Yahoo! only partially succeeded in doing that with “Where”, using the Yahoo! Geo platform. Facebook now allows Websites and applications to connect via the Who (Facebook identity). Increasingly, Facebook will increase the usefulness of there “What” and “When” for other applications. The Places feature, cleverly, was already launched with integration of various companies (e.g., FourSquare) that can use the Facebook Places platform. There is no reason why this platform will soon be open (and used) by many other developers, giving Facebook ownership of Who and Where on the Web.

Going forward also means improving the capabilities of the Facebook platform in connecting and mashing the various entities. For example, to be able to record the fact that I “this picture was taken in the event Elvis Perkins in Dearland at Governor’s Island with my friend Kathleen”. Seems like that may be coming! Many other applications are of course possible (e.g., “all the Statuses ever posted from this classroom”).

And where is Twitter? With the less specific “annotation” feature, and lagging behind in the Who space, Twitter is struggling in the  objectverse, despite a strong geo-bend and a major push last year.

Interaction, movement, and dance at DIS 2010

Denmark. Århus. DIS 2010. I was particularly excited to be presenting the first detailed paper on Graffiti Dance (an art performance I co-organized last year with Renata Sheppard and Jürgen Schible). Unfortunately, Naaman wasn’t there; it’s fun for the two of us to storm into a distant country…hilarity ensues. The conference itself was spectacular. With all time lows for acceptance rates (I believe full papers were at 15% and short papers somewhere north of 21%; 2008 had about a 34% acceptance rate), the talks covered everything from prototypes to rich qualitative studies. Aaron Houssian liveblogged all three days in case you need to catch up: [Day 1, Day 2, Day 3]. I spoke on Day 3, the morning after we build a nail gun sculpture.

Now with any good talk you present, you should have some new insight to your work. In this case, I decided not to present what’s in the published article which covers some theory, design process, and system—concluding with an informal exit interview with the audience and the dancers. You should check out the video describing the performance on Vimeo. Instead, I presented the providence of the idea; how three artists far apart from each other made this happen.

First, as it was pointed out to me, nothing new was really created to make this installation happen. There were these system components for other performances that we reused to make something completely unique. The Computer Scientist in me appreciated this deeply. Sometimes, in particular with art, we fight for novelty. Henri Toulouse Lautrec put it best:

In our time there are many artists who do something because it is new.. they see their value and their justification in this newness. They are deceiving themselves.. novelty is seldom the essential. This has to do with one thing only.. making a subject better from its intrinsic nature.

Second, this takes a group painting and stencil image session and maps the on-screen movement (created by the scurry of 4 mouse cursors and brushes scrambling to create an image) and maps it to movement in the audience (facilitated through dancers). Why not map the dancers to the drawn image, rather than the movement of the cursors? It occurs to me (after a few discussions with Renata) that most approches proxy movement through audio cues, drawn images, or time of day. Our performances thing about connected action between people. Motion tied to motion is a much stronger link than an image tied to motion. Movement is not a proxy. This relates to a responsive dress Renata and I made last year, the lights in the dress respond to the dancers movements.

Light Dress

Finally, this performance carries the larger research agenda of mine: how do we build for connected synchronized action? For this embodiment that is this performance, that’s worth a longer journal paper.

[Note: once the ACM Digital Library hosts the proceedings, I’ll add a link to the published paper here]

iSticks iSteelpan iTaiko and iMan

Hey Naaman? You get one of those shiny new iPad things? Ever since I saw them…I thought there was something there. Such a nice big screen. So many colors. It’s stunning. Makes me want to hit it with something.

Apple, well Mr. Steve, seems to dislike the idea of input devices aside from your hand. No pens. No stylus. Use it naturally. I think there’s something to that mantra, but then again, we do a lot as humans with tools and instruments. The exacto knife, a spatula, a paint brush…all of these things let us manipulate and create things around us. Touching is great for interacting, but we tend to create with instruments.

So, when I thought to myself that I wanted to poke and hit an iPad, I had a problem. I had no iPad. As fortune would have it, I borrowed one for one month from a friend in exchange for a box of fancy chocolates.

The second issue arose when I remembered the touch screen is capacitive. Hit it all day long with a stick; nothing. It need to carry a charge and feel like a relatively fatty finger. I immediately thought of modern conductive fabric; much less greasy than a korean sausage though not as tasty.

Armed with a metal dowel, conductive fabric, textured cotton, and some string, I showed up at Music Hackday in SF one Saturday morning and made some drumsticks. You can see how I built the sticks on Instructables:

iSticks: How to make a drumstick for an iPad.More DIY How To Projects

Now…with sticks in hand, I built my second ever iPhone app. A Taiko drum. Just to test the idea out. Not wanting to make another R8 drum kit on my borrowed iPad, I thought of a more esoteric instrument. A Steel Pan drum! Once I built the steel drum, I realized I didn’t know how to play it. So I made a tutorial that acts like a whack a mole game and teaches you how to play twinkle twinkle little star. The app won two awards at the San Francisco Music Hack Day.

Currently, iSteelPan and iTaiko are free in the App Store, which took some doing (initially Apple said I had some trade mark infringements around the tutorial). Distribution of apps…someone should run a workshop on that. Oh right, Henriette Cramer is; Deadline’s in two days…good luck!

The Secret Life of (One) Professor: Two Years In

Matt Welsh of Harvard recently wrote on the Secret Lives of Professors, a post that stirred a lot of discussion and struck a chord with a somewhat less experienced professor (that would be me; two years on the job vs. Matt’s seven). I found my self nodding at many of Matt’s well framed observations.

Matt’s main “surprises” and lessons that he offers to grad students in his post include:

Lots of time spent on funding request. I have had a similar experience, because (like Matt) I enjoy working with, and leading, a large group of researchers. Of course, the batting averages are low for funding requests (Matt downplays his success rate but I bet it’s better than average). In my first two years, I submitted 3 NSF proposals, 2 of which were declined and one outstanding (a good sign); I am currently working on two more. Each of these took significant effort, in one case at least (an estimated) two full months of my time. In addition, I submitted a number of smaller-scale proposals, most of them to quick and easy to write, and was fortunate enough to get a Google Research Award (thanks again Goog!), and to be assigned as a faculty mentor to a superstar two-year postdoc Nick Diakopoulos. Together with some other odds and ends (thanks SC&I!) I feel pretty happy after two years regarding the group and resourced I amassed; but the cost on my time is still substantial. On the bright side, as Sam Madden points out in the comments to Matt’s article, some of the grant proposal process is actually helpful in helping me think about future work and research agendas, even if the specific proposal does not get funded.

The job is never done. Even as I write this, I could (and feel that I should!) be editing a paper, or looking at some data, or catching up on email, or working on one of two said proposals. Matt’s admits:

For years I would leave the office in the evening and sit down at my laptop to keep working as soon as I got home.

I can’t say my experience is far from that, although I still insist on taking good vacations. And a 2-year old kid certainly makes for a compelling reason to stop working at any time.

Can’t get to “hack”. True enough, most of the interesting work is delegated to students, as Matt complains that he doesn’t find time to write code. However, that is partially the decision that Matt (and I) knowingly take when we decide to work (and try to fund) a large group of students. Managing fewer or no students might allow more individual research work, which is certainly a path taken by some faculty that skip on the funding requests and the resultant students meetings. However, I am no Ayman, do not miss writing code, and am happy to farm that out to students. I do enjoy thinking about the intellectual and research issues, and often get to do that with the students. I would like to have fewer meetings and less email, but unlike Matt I feel involved enough in the intellectual work, at least so far. Nevertheless, I can’t dive into it like the grad students who indeed “have it good”.

Working with students. Matt writes:

The main reason to be an academic is… to train the next generation.

I see it the same way (the intellectual pursuit is also up there, but it could be claimed that you can perform similar intellectual pursuits in other settings like research labs). The students is why I am in academia, and the advising is by far my favorite activity. From solving someone else’s problems (e.g. a student not sure how do approach X or Y) to, more substantially, showing students a path from a first-year confusion to an experienced researcher that understands how to ask (and answer) research questions, and communicate it effectively. Well, I am clearly not quite there yet having just recently started doing it (and just started funding my first PhD student). But I am enjoying it already. Like Matt, for me it is not just working with the PhDs and Masters students; the undergrads play a big role. I started working with several star undergrads, some of them have never SSH’ed into a server before, most of them have never seen how research is done. Their wide-eyed excitement is an energy source, an inspiration and a cause of constant enjoyment.

So, the bottom line?

It is certainly not for everybody. It remains to be seen if it is even for me.

I will buy that, Matt. At the end of the day, for me, it’s the students, and the freedom to carve my own path. This summer I am lucky enough to be working with my group at SC&I consisting of one postdoc, 2-3 Phd students, 3 Masters students, and 1-3 undergrads (at any given time). With teaching (more on this topic later) out of the way, I spend two full days a week with this gang talking about research, writing papers or grants, having other “good” meetings, or playing Rock Band on our Wii. It’s definitely one of the best work summers I have had, much like my summers at Yahoo! Research Berkeley where we had most of our fantastic interns join in on the fun.

Speaking of the defunct Y!RB, and regarding that path-carving freedom, I feel a lot less constrained in academia compared to industry research. I have had a fantastic experience at Yahoo!, and was lucky to have a great team at the Berkeley lab. However, to start my own project at Yahoo!, that follows my own personal vision, and involved multiple people, would have taken a lot of convincing (and would need to be ultimately tied to corporate agenda). I know Ayman does not agree, so maybe this is just a false sense that I have, that moving a bunch of people towards a vision that I choose and craft is easier in academia. To do that with the students might be, as Matt put it, “the coin of the realm”.

Apple Does Migrations (Almost) Perfectly

Just got a new Macbook pro. I’ve been on Mac for about 5 years now, and the number one most impressive feature to me is the migration. As someone lucky enough to be in a place with a fantastic IT department (yes, I know that’s unlikely, but our IT people are superstars) it means just dropping off my old Mac, and, voila! few hours later I have all the setup I had before (down to the browser history items), reproduced on a lovely new machine.

Just a few things went wrong, most of which are Apple’s fault, and some of which are quite annoying.

First, the Mac didn’t recognize the iPhone. Luckily I was clever enough to think of checking for a Mac software update, and sure enough, the only update available was a fix to this bug. +1 point, Apple.

But it got worse once the iPhone was recognized. Soon enough I got this notice right here:

OK, a little scary, and totally wrong (not getting into DRM discussion here) but not so bad as a user experience — the dialog allowed me to continue, give me options, I can live with that (but why didn’t the migration carry forward my authorization?). Anyway, I asked to authorize, only to get another prompt: Something like “sorry, you already have 5 authorized computers”.  This time, I was offered no way out other than acknowledging that lovely, yet curious fact (which 5 machines I had authorized? Ayman certainly didn’t get my permission for any content!). I was too shocked to take a screen grab of that pesky dialog. Still, this wasn’t a big deal, because I knew what to do – de-authorize all my computers (the only one I knew I had authorized was not with me — I migrated from it, see — so I couldn’t just de-authorize it). But that’s wrong, Mr. Jobs. Why would a “normal” (i.e., not 6’8″) user know how to de-authorized their other computers? Instead, I would like to have seen this process:

1. “Hey, it seems like you already reached the maximum number of computers allowed to access your licensed content! Would you like to fix that?”

Options: But of course! / No, I’ll just curl up in the corner and cry

2. “Here are the details of your 5 authorized computers. Which one(s) would you like to de-authorize?”

Options: Select any number of computers to de-authorize.

3. Done!

Easy, Steve? -gazillion points, Apple!

Another thing that didn’t migrate properly was my Screensaver (although my desktop pictures preference were kept). I guess that’s because in Snow Leopard you need to use iPhoto albums to choose screensaver photos. But why would Desktop background work and screensaver break? Slightly bizarre.

The wifi was also a mild annoyance, forgetting all my preferences (but at least remembering the networks’ credentials for secure networks).

Finally (geek/grad student topic alert), I lost my Latex (MacTex) installation in the migration to the new Mac. I mean, the files were still there but the migration broke a few symbolic links and just tampered with a folder structure enough to make my various Latex editors not find the MacTex installation. MacTex have a several-step solution, but you know me, I take my short cuts (just upgraded to MacTex 2009), which fixed all these issues.

So, Apple could have made this really close a perfect game, but allowed a couple of walks in there late in the innings, just to have Naaman complain. Well, what would I do without them.