Over the past few years, my work has become rather quantastic. This is possibly due to the so called big data world we live in: a result of storage becoming cheap and computing becoming ubiquitous. Naaman generally won’t geek out with me anymore as I’ve grown fond of the letter R. I’ve dragged an intern or two through it. Others have been asking me for a good ‘get started’ guide. In fact, there isn’t one; there are several. However, R’s difficulty partially comes in the packages you want to use. Part of it comes in just knowing its structure and how to select things. Years ago, while teaching studio art, I devised a Photoshop tutorial that was all based on selection with the marquee and magic wand tools. I told the students “if you can select it, then you can do anything…try not to get excited about the plugins so much.” It’s about time I shared this simple R tutorial which is written with the same philosophy (oh ya – go get R first and run it. you should be looking at a console window)…we’ll start now with stupid R tricks but after a few posts be knee deep in making stuff happen.
> x <- c(1,3,6,9,12) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 3.0 6.0 6.2 9.0 12.0 > c(1.5,2,7,8,15) -> y > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.5 2.0 7.0 6.7 8.0 15.0
Here we call the c(…) function to combine some numbers into an array/vector and store it. You can use the = operator but really you want to use <- or ->. Think of it as funneling something into the variable rather than overwriting it. The summary(…) function will try to give you a quick glimpse into a variable you might have lying around. So, now we can call some simple stats stuff.
> mean(x) [1] 6.2 > median(x) [1] 6 > sd(x) [1] 4.438468 > var(x) [1] 19.7
This gets good when you have more data than Excel would like to hold (you know like over 10,000 rows)…we’ll see later it’s super easy and kinda tricky to read something from disk later on. So now we have two vectors x and y, lets find a correlation like so:.
> cor(x, y) [1] 0.965409 > ?cor > cor.test(x, y) Pearson's product-moment correlation data: x and y t = 6.413, df = 3, p-value = 0.007683 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5608185 0.9978007 sample estimates: cor 0.965409
Pretty simple stuff. Notice calling ?cor brings up info about the cor(…) function in a new window. So lets go ahead and lets plot it.
> plot(x,y)
Lines…let fit a line to the plot. The function call lm(…) fits a linear model. We need to express y as a function of x this is done with the ~ oddly enough…we’ll call this myline.fit which is a nicer variable name than just a non-expressive letter:
> myline.fit <- lm(y ~ x) > summary(myline.fit) Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5 0.9898 -0.8909 0.5381 -2.0330 1.3959 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.6802 1.3665 -0.498 0.65285 x 1.1904 0.1856 6.413 0.00768 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.648 on 3 degrees of freedom Multiple R-squared: 0.932, Adjusted R-squared: 0.9094 F-statistic: 41.13 on 1 and 3 DF, p-value: 0.007683
Then we make the plot and then add a line to it and some new points in green just for good measure.
> plot(x,y) > abline(myline.fit) > x2 <- c(0.5, 3, 5, 8, 12) > y2 <- c(0.8, 1, 2, 4, 6) > points(x2, y2, col="green")
Not too pretty but a plot should be visible, we can worry about pretty later.
Perhaps we can make it pretty after we read some data in and start getting real. Next time!
I am so picking up. Keep writing this series please!
While I haven’t played much with R yet, sadly, I definitely dragged a PhD student or two through it… (props to Funda). Keep these coming!
Pingback: R 2D data and simple Map Plotting › The Ayman and Naaman Show
I have been using R since I started my PhD three years ago, and I feel I could have not do it without it. I also dragged at least three colleagues through it. Hopefully, I can learn from stuff through your posts;)