-
Notifications
You must be signed in to change notification settings - Fork 63
/
Tutorial8_AdvancedTopics.Rmd
363 lines (299 loc) · 11.9 KB
/
Tutorial8_AdvancedTopics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
% Tutorial 8: Advanced Topics
% DPI R Bootcamp
% Jared Knowles
```{r setuph, include=FALSE}
# set global chunk options
opts_knit$set(animation.fun=hook_r2swf)
opts_chunk$set(fig.path='figure/slides8-', cache.path='cache/slides8-',fig.width=12,fig.height=9,message=FALSE,error=FALSE,warning=FALSE,echo=TRUE,size='tiny',dev='png',out.width='700px',out.height='500px',cache=TRUE)
library(eeptools)
library(ggplot2)
load('data/smalldata.rda')
```
# Overview
In this lesson we hope to learn:
- Coding Style
- Write a `for' loop
- Write a basic function
- Optimize R with parallelization
- Mixed effect models
- Data mining with R
- Animations
- Git, GitHub, and add-on packages
<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>
# Basic Principles
- Good professional data analysis is computer programming
- An analysis project is computer code
- Computer code has bugs and to fight bugs the code has to be readable
-
# Coding Style
- An important part of making work reproducible is making your code readable and understandable
- Much of this involves looking into where to break lines and how to simplify expressions without making them obscure
- You can also take some cues about how to make your code consistent and clean from reading code of others on GitHub
- A few tips help though
# For Looping
- Don't do it for reasons we have discussed, R is very slow at these
- Sometimes it is inevitable, so here is how to do it:
```{r writeaforloop}
# Loop to calculate number of students per grade
nstudents<-rep(NA,6)
for (i in unique(df$grade)){
nstudents[[i-2]]<-length(df$stuid[df$grade==i])
}
nstudents
summary(factor(df$grade))
```
# Why is a loop slow?
- Run a simple test [from StackOverflow](http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r)
```{r timingloops}
A = matrix(as.numeric(1:100000))
system.time({
Sum = 0
for (i in seq_along(A)) {
Sum = Sum + A[[i]]
}
Sum
})
system.time({
sum(A)
})
rm(A)
```
# Write a simple function
- Functions are easy
- To view any function in R just type `print(myfunction)`
- For speed, some functions are not viewable because they are bytecompiled
```{r reviewingfunction}
print(mean) #bytecode, we can't see it
print(order)
```
# Still, we can write a number of simple functions very quickly
- Let's write a function to turn factors into characters
- Because typing `x<-as.character(x)` is really obnoxious
```{r writefunction,eval=FALSE}
defac<-function(x){ # assign function a name, and list its arguments
x<-as.character(x) # what does function do?
x # last line is output of function
}
a<-factor(letters)
summary(a)
summary(defac(a))
summary(as.character(a))
```
# Simple Data Cleaning Function
- What if we want to extract the numeric elements out of `foo` which has both a character and a missing value?
```{r declassfunction}
extractN<-function(x){
x<-suppressWarnings(as.numeric(x))
#ignore warnings because we don't care
x<-x[!is.na(x)]
x
}
foo<-c(1,4,3,NA,5,60,NA)
extractN(foo)
A<-extractN(foo)
```
# Complicating Functions
- Functions can have many arguments and be much more complex
- They can also be incredibly specialized--remember we are writing code for us!
# Debugging Functions
-
# Mixed Effect Models
- Mixed effect models, or random effect models, are a staple for analyzing data that exists in groups (like students in classrooms)
- The underlying belief is that observations in the same group, to some degree, look more like one another than would be randomly expected (or than members of the other group)
- Random effects help us measure and parse out this similarity to avoid it biasing our statistical model (fixed effects is another alternative)
- In R we use the `lme4` package to do this work, or in a Bayesian framework (*fancy*) we can use winBUGS or JAGS
- We'll stick with `lme4` by Doug Bates (a retired UW-Madison Professor of Statistics)
# The Basics of lme4
`mymod_me<-lmer(readSS~factor(grade)+factor(race)+female+disab+ell+(1|dist)+(1|stuid),data=df)`
- items in `(1|dist)` denote a random effect. In this case we are measuring a random effect for the variable `dist` in our data.
- Otherwise the formula is the same as a regular `lm` formula (by design
- The output is tricky to interpret, but a basic mixed model is more accurate than the non-mixed version of the model (at least in nested school/district student test data)
```{r lme4mod}
library(lme4)
mymod_me<-lmer(readSS~factor(grade)+factor(race)+female+disab+ell+(1|dist)+(1|stuid),data=df)
print(mymod_me,correlation=FALSE)
```
# LMER vs. LM
```{r modelcomp,fig.show='asis',dev='png',out.width='700px',out.height='500px'}
mymod<-lm(readSS~factor(grade)+factor(race)+female+disab+ell,data=df)
qplot(readSS,predict(mymod),data=df,alpha=I(.3),color=I('blue'))+geom_point(aes(x=df$readSS,y=fitted(mymod_me)),alpha=I(.4),color='dark green')+
theme_dpi()+xlab('Observed')+ylab('Predicted')+
geom_text(aes(x=370,y=700),label='Green is Results of the \n Mixed Model')
```
# Data Mining with R
- When prediction is all that matters and we do not need to understand the *what* of prediction, just to accurately classify groups, then `caret` provides best in class data mining tools to us
- More and more data miners in competitions, industry, and high stakes settings are using algorithms programmed and developed in R to do their work
- We don't have time to fully implement such an analysis here, but just show what a typical workflow for the `caret` package looks like
# Using the caret
```{r caretscript,echo=TRUE,eval=TRUE}
library(caret)
# Set aside test set
testset<-sample(unique(df$stuid),500)
df$case<-0
df$case[df$stuid %in% testset]<-1
# Draw a training set of data (random subset of students)
training<-subset(df,case==0)
testing<-subset(df,case==1)
training<-training[,c(3,6:16,21,22,28,29,30)] # subset vars
trainX<-training[,1:15]
refac<-function(x) as.factor(as.character(x))
trainX$stuid<-refac(trainX$stuid)
trainX$dist<-refac(trainX$dist)
trainX$year<-refac(trainX$year)
# Parameters
ctrl <- trainControl(method = "repeatedcv", number=7,repeats=3,
summaryFunction = defaultSummary)
# Search grid
grid<-expand.grid(.interaction.depth=seq(2,6,by=1),
.n.trees=seq(200,700,by=100),
.shrinkage=c(0.05,0.1))
# Boosted tree search
gbmTune<-train(x=trainX,
y=training$mathSS,
method="gbm",
metric="RMSE",
trControl=ctrl,
tuneGrid=grid,
verbose=FALSE)
#gbmPred<-predict(gbmTune,testing[,names(trainX)])
```
# Plot GBM
```{r gbmplot,out.width='700px',out.height='500px'}
plot(gbmTune)
```
# Optimizing R
- For **big** data problems, sometimes R can be a little slow
- There are two ways to speed R up--compiling code on the fly and running code in parallel
- Compiling code turns our function into **bytecode** which can be executed by the computer faster
- Running parallel uses the multiple *processor cores* in just about every modern PC to simultaneously calculate parts of a problem
- Unfortunately, when you write your code in parallel it really depends on what operating system you are using since different functions to do parallel work are set up differently on each OS
# A Quick Windows Parallel Example
```{r parallelwindows,eval=TRUE,include=FALSE}
xint <- c(-1,2)
yint <- c(-1,2)
func <- function(x,y) x^3-3*x + y^3-3*y
g <- expand.grid(x = seq(xint[1],xint[2],0.1), y = seq(yint[1],yint[2],0.1))
g$z <- func(g$x,g$y)
# Dumb for loop
integLoop <- function(func, xint, yint, n)
{
erg <- 0
# interval sizes
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
for(xi in seq(xint[1],xint[2], length.out=n)){
for(yi in seq(yint[1],yint[2], length.out=n)){
# Calculating one rectangular box
box <- func(xi, yi) * xincr * yincr
# Summarizing
erg <- erg + box
}
}
return(erg)
}
# Vectorized
integVec <- function(func, xint, yint, n)
{
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
# using vectors instead of loops
erg <- sum( func( seq(xint[1],xint[2], length.out=n),
seq(yint[1],yint[2], length.out=n) ) ) * xincr * yincr * n
return(erg)
}
# Using Apply
integApply <- function (func, xint, yint, n)
{
applyfunc <- function(xrange, xint, yint, n, func)
{
# calculates for every x interval the complete volume
yrange <- seq(yint[1],yint[2], length.out=n)
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
erg <- sum( sapply(xrange, function(x) sum( func(x,yrange) )) ) * xincr * yincr
return(erg)
}
xrange <- seq(xint[1],xint[2],length.out=n)
erg <- sapply(xrange, applyfunc, xint, yint, n, func)
return( sum(erg) )
}
integSnow <- function(cluster, func, xint, yint, n)
{
nslaves <- length(cluster)
erg <- clusterApplyLB(cluster, 1:nslaves, slavefunc, nslaves, xint, yint, n, func)
return( sum(unlist(erg)) )
}
# Parallel in Snow
integSnow <- function(cluster, func, xint, yint, n)
{
nslaves <- length(cluster)
erg <- clusterApplyLB(cluster, 1:nslaves, slavefunc, nslaves, xint, yint, n, func)
return( sum(unlist(erg)) )
}
slavefunc<- function(id, nslaves, xint, yint, n, func){
xrange <- seq(xint[1],xint[2],length.out=n)[seq(id,n,nslaves)]
yrange <- seq(yint[1],yint[2], length.out=n)
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
erg <- sapply(xrange, function(x)sum( func(x,yrange ) ))* xincr * yincr
return( sum(erg) )
}
```
```{r parallelperformancetest, echo=TRUE}
n<-10000
rep<-5
#tLoop <- replicate(rep, system.time( integLoop(func, xint, yint, n) ))
#summary(tLoop[3,])
tVec <- replicate(rep, system.time( integVec(func, xint, yint, n) ))
summary(tVec[3,])
tApply <- replicate(rep, system.time( integApply(func, xint, yint, n) ))
summary(tApply[3,])
# 2 Core Cluster
library(snow)
c1<-makeCluster(c("localhost","localhost"),type="SOCK")
tSnow1 <- replicate(rep, system.time( integSnow(c1, func, xint, yint, n) ))
summary(tSnow1[3])
stopCluster(c1)
```
# Animations
- R can produce animated plots as well
- Though animations get a bad rap in the data visualization community, they can be effective at illustrating things to users
- Requires GMCONVERT and FFMPEG and potentially other tools
```{r anisetup,include=FALSE}
x<-rnorm(10)
p<-ggplot()+geom_density(aes(x=x,color=length(x)))+geom_vline(aes(xintercept=mean(x)),color=I('red'),alpha=I(0.2))+scale_color_gradient(limits=c(10,30))
p<-p+theme_dpi()+xlim(c(-4,4))+ylim(c(0,.5))+coord_cartesian()
```
# Animating a Random Normal Distribution
```{r animationtest,echo=FALSE,fig.show='animate',dev='png',out.width='700px',out.height='500px'}
for (i in 10:30){
x<-rnorm(i)
print(
ggplot()+geom_density(aes(x=x,color=length(x)))+scale_color_gradient(limits=c(10,30))+
geom_vline(aes(xintercept=mean(x)),color=I('red'),alpha=I(0.3))+theme_dpi()
)
}
```
# Git and GitHub
- Working alone or in a group, `git` and GitHub can help
- [Learn Git and GitHub](http://try.github.com/)
- Just try GitHub for a project, it is a great way to organize code
- Of course, you can find all these tutorials [on GitHub](http://www.github.com/jknowles/r_tutorial_ed)
# Exercises
1. Enjoy R!!!!
# References
1. [Technical Report on Parallel Computing in R](http://dirk.eddelbuettel.com/papers/parallelR_techRep.pdf)
# Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
```
# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>