Taking R to the Zoo

Feeling pretty good about now?   So far we’ve just played in the garden; there are problems when you enter the real world.   Let’s start by looking at this dataset of TacoBell tweets.  It’s about 10,000 tweets.   So still pretty small, but the 3.1MB of deliciousness can cause us some problems.    First, lets read it in.  We’ll do so from the URL loader.

> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> system.time(f <- read.csv(u))
   user  system elapsed
  1.331   0.137   9.010

Here, we create a URL file object then pass it to our read.csv function.  Upon completion, you won’t notice it closes the URL file object.    This will take a few seconds to load, you can wrap any command by system.time(…) to see how long it takes.   Now lets look at what we have:

> dim(f)  # how many rows and columns?
[1] 9413    9
> class(f)
[1] "data.frame"
> class(f$s)
[1] "factor"
> class(f$Source)
[1] "factor"
> class(f$Title)
[1] "factor"

The class ‘factor’ is a nominal variable and R loves it.  It’s good if you have distinct types to specify, but not so much for dates or tweets.

> f$s[1]
[1] 01/31/11 04:21 AM
2011 Levels:        01/24/11 01:04 PM 01/24/11 01:06 PM ... 01/31/11 12:16 AM

The 2011 levels tells us there are this many distinct timestamps in the dataset.  We need dates to be, well, dates.  And tweets to be text.  We can convert the arrays or variables by wrapping them in a converting function like:

> f$Body[1]
[1] I really want Taco Bell. I dont care if its fake meat!
8612 Levels:  ... ????Taco Bell???????????????????????????????????????????????????????????
> as.character(f$Body[1])
[1] "I really want Taco Bell. I dont care if its fake meat!"

But really, the best way to do this is to make sure the reader pulls in the right data class when the file loads.  This is specified in the read.csv file.

> types <- c("character", "factor", "factor", "character", "character", "character", "character", "character", "character")
> u <- url("http://blog.looxii.com/wp-content/uploads/2011/01/tb-tweats-jan24-jan31.csv")
> f <- read.csv(u, colClasses=types)
> class(f$s)
[1] "character"

Great!  The c(…) function made a vector of strings, one for each column in the file; each entry is the name of the class for that column and we pass it in to the read.csv(…) function.  Next, we want just the tweets with @ symbols.  In R, we can grep in a string like so:

> grep("@", "this is a test")
> grep("@", "this is @ test")
[1] 1
> grep("@", c("this is not a test", "this is @ test"))
[1] 2

That 1 is an array index, not a truth value.  Watch, lets check for an @ symbol in the first 5 rows of our dataset.

> grep("@", f$Body[1:5])
[1] 3 4 5

So, rows 3, 4, and 5 have an @ symbol.  Oh hey, that’s a nice little index vector into the csv file!  So, if we want to make a new variable which is just the @ symbols, it’s easy, just say give us all those rows by passing that vector in as the row indices.

> dim(f)
[1] 9413    9
> ats <- f[grep("@", f$Body), ]
> dim(ats)
[1] 4031    9

So roughly 42% of the dataset has @ symbols. Now we’ll need the zoo package.  Go get it.

> install.packages("zoo")
Installing package(s) into ‘/Users/shamma/Library/R/2.13/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.cnr.berkeley.edu/bin/macosx/leopard/contrib/2.13/zoo_1.7-6.tgz'
Content type 'application/x-gzip' length 1396545 bytes (1.3 Mb)
opened URL
downloaded 1.3 Mb

The downloaded packages are in

Now, we still need the first column as timestamps, not character arrays.

> ?strptime
> strptime(ats[1,1], format="%m/%d/%y %I:%M %p")
[1] "2011-01-31 04:20:00"
> class(strptime(ats[1,1], format="%m/%d/%y %I:%M %p"))
[1] "POSIXlt" "POSIXt"

strptime(…) lets us convert strings to timestamps with a specified format.  The ?strptime command will tell you what to use for formatting as its different from other languages you might know.  Great, we can do this against the whole column and make a “zoo” or Z’s Ordered Observations.

> library(zoo)

Attaching package: ‘zoo’

The following object(s) are masked from ‘package:base’:

    as.Date, as.Date.numeric

> ?zoo
> z <- zoo(ats$Title, order.by=strptime(ats[,1], format="%m/%d/%y %I:%M %p"))
Warning message:
In zoo(ats$Title, order.by = strptime(ats[, 1], format = "%m/%d/%y %I:%M %p")) :
  some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique

Just ignore the warnings right now.  What we are doing is ordering our data set (in this case the Titles) by the timestamp.  The strptime(…) command is applied to the first column of the dataset (remember how R distributes a function across a vector?).  Really we are going to use the zoo as an intermediate data structure.  Now we aggregate them, and we will do this by the number of tweets by minute.

> ats.length <- aggregate(z, format(index(z), "%m-%d %H:%M"), length)
> summary(ats.length)
         Index        ats.length
 01-24 05:00:   1   Min.   : 1.000
 01-24 05:01:   1   1st Qu.: 1.000
 01-24 05:03:   1   Median : 2.000
 01-24 05:05:   1   Mean   : 2.803
 01-24 05:06:   1   3rd Qu.: 3.000
 01-24 05:07:   1   Max.   :29.000
 (Other)    :1432

The aggregate(…) function takes the zoo and collectes them by the specified function. In this case, we chose length, so this is the total number of tweets per minute (the length of the vector at that time aggregate, not the length of the tweets).  We can easily aggregate by the hour by changing the time format:

> ats.length.H <- aggregate(z, format(index(z), "%m-%d %H"), length)
> summary(ats.length.H)
      Index      ats.length.H
 01-24 05:  1   Min.   : 1.00
 01-24 06:  1   1st Qu.:17.00
 01-24 07:  1   Median :31.00
 01-24 08:  1   Mean   :29.64
 01-24 09:  1   3rd Qu.:42.00
 01-24 10:  1   Max.   :70.00
 (Other) :130

Or even calculate the mean if the zoo contained numeric data (like follower counts) by changing the function specified to aggregate.  Plotting this is easy too…but instead of plotting to the screen, lets save two PNGs.

> png("byminute.png")
> barplot(ats.length)
> dev.off()
null device
> png("byhour.png")
> barplot(ats.length.H)
> dev.off()
null device

The png(…) function opens a PNG file for writing.  Then any plotting command will be written to disk (and not displayed) until you call dev.off().  Our two plots look like (minute on the left, hour on the right):

What’s great is, you can save a vector PDF too by using the pdf(…) function just like the png(…) one.  Next time, we’ll talk about dealing with something really really big data wise.

PS: years ago, I asked how to do this kind of aggregation on StackOverflow, which is a great resource for R help or just about any other programming language.

PPS: Bonus points for doing this but computing the average tweet length by minute.

Leave a Reply

Your email address will not be published. Required fields are marked *