-
Notifications
You must be signed in to change notification settings - Fork 63
/
Tutorial2_DataImport.Rmd
352 lines (299 loc) · 13.3 KB
/
Tutorial2_DataImport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
% Tutorial 2: Getting Data In
% DPI R Bootcamp
% Jared Knowles
# Overview
- Loading Packages
- Understand data types
- Load a CSV file
- Organize an analysis project
- Query a database
<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>
# A quick note on R `packages`
- `packages` are essentially free and open source add-ons for R
- There are over 3,000 packages available for R that add all sorts of functionality
- A few examples (from the mundane to the crazy)
* Additional graphics capabilities from the `ggplot2` package
* Advanced regression techniques from the `lme4` package for mixed effects models
* 3d graphics from the `scatterplot3d` package (also `webGL`)
* GIS analytics and mapping functionality with `sp`
* Text mining analytics with `tm`
* Predictive modeling frameworks with `caret`
* Interfaces to other programming languages like Python, Java, and C and C++
* A web server: `Rserve`
* And Minesweeper from the `fun` package
# I can haz packages?
```{r eval=FALSE,echo=TRUE}
# You can find and install packages within R
install.packages('foo') # Name must be in quotes
install.packages(c('foo','foo1','foo2'))
# Packages get updated FREQUENTLY
update.packages() # Gonna update them all
```
- Note, on Windows Vista and later R either needs to be run as an administrator to install packages, or you have to fiddle with where the packages are installed
- Packages are stored in something called the **library** which is just a collection of packages
* Sometimes folks call packages **libraries**
- Loading a package couldn't be easier `library(ggplot2)` and you're done!
# Finding Packages
- Official packages are found on CRAN (Comprehensive R A Network)
- Unofficial packages or **beta** versions of packages are found on [RForge](http://r-forge.r-project.org/) and [GitHub](http://www.github.com)
- To find out what packages are out there that do a specific function, try:
* Google "doing X in R package"
* Look at CRAN taskviews
- [CRAN taskviews](http://cran.r-project.org/web/views/) are great to find a bunch of packages related to a problem you are trying to solve
# Some Must Have Packages
`plyr` `ggplot2` `lme4` `sp` `knitr`
- These packages come with data, code, functions, and utilities that you can access
- If you want to learn about them, check them out in the RStudio Packages tab, which gives an index of their help files and handy links to their help documentation
- Some packages also have `vignettes` which walk you through use cases for the package
- `caret` has a great example
# Data Management
- Data management in R used to be managed by the `ls()` command
- Go ahead, type it.
- Now you can look at the Workspace tab in RStudio and have a complete list of the data in R's memory that is accessible to you
- All data objects have names
- All object names are unique (not strictly so, but let's not violate this)
- To reference items within an object type we need to give it an address like in `mydata$thingIwant` or `mydata@thingIwant`
- The `$` and `@` distinction depends on whether this is an S3 or an S4 class
- R will warn you if you do it wrong, so just remember when `$` doesn't work, use `@`
# The Working Directory
- The working directory, lovingly denoted by `wd` is both your friend and enemy
- R needs to know where it is in your file system to be able to access data and write output
- Check this by typing `getwd()`
- R work is best done in a selfcontained directory, like `C:/Users/My Documents/My Project/` which is then set as the working directory
- How to set the working directory? The `setwd()` command: `setwd("PATH/TO/MY PROJECT/")`
# Preliminaries
- Get and set the working directory
- Understand file system paths
- Understand relative and full paths
- Find out where things are
- To start, we'll download the Bootcamp files and open up "Tutorial 2.R"
- We'll notice that in the Bootcamp folder we have a number of subfolders for things like `data` and `handouts`
- When you do an R project, it really helps to put it into a self-contained folder so you can reference everything you need easily within the file system
# RStudio Shortcuts
- RStudio makes this so much easier!
- RStudio can set the working directory to the **Files** pane in the bottom right, or to the **Source** file in the top left
- The source file is the most likely path, but sometimes we need to access data in a separate folder and store results in another
- So RStudio shortcuts can't help there
# RStudio Projects
- A great feature of RStudio is that it allows us to create a **project** which allows us to quickly jump from folder to folder, store data, and keep open tasks and scripts
- If you have multiple R tasks at once (how lucky!) you have to use projects
- Let's walkthrough creating a Project in RStudio and understand it a little bit
- **DEMO**
# Manipulating Project Paths
- Paths are strings that tell R where to find files on your computer
- R uses both full and relative paths
- When we want to call files from different directories or write to different directories, we can use relative paths within the project
- Relative paths look like `/data` or `~data` which means, look for the `data` folder in our current working directory
- Full paths work differently, they specify the exact location on the hard drive of the machine like `C:/Path/To/My/Data` or `usr/home/jaredrocks`
# When to use a full path
- Never
- It breaks the ability to pass a project directory to a coauthor/collaborator
- But when you must...
- When you have something on a whole different drive or network store
- When you don't think you'll migrate your code to a new operating system or a different machine/network environment
- When the project is simple
# Ground Rules
- Get used to plain text input files
* R can handle other formats, but your error rate increases as does the tweaking necessary
- R has a limited set of special characters (symbols) you cannot use in your data input to be translated correctly
- These symbols are reserved and will be interpreted in strange ways if you include them in your plain text data file
- Most of them are fairly obvious operators, see [Paul Murrell's excellent summary](http://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node75.html)
# Missing Data Symbols
- Missing data has the symbols `NA` or `NaN` or `NULL` depending on the context.
- Consider:
```{r missing}
a<-c(1,2,3) # a is a vector with three elements
# Ask R for element 4
print(a[4])
```
- But what is the difference between `NA` and `NULL`?
```{r nulldata}
a<-c(a,NULL) # Append NULL onto a
print(a)
# Notice no change
a<-c(a,NA)
print(a)
```
- `NA` can hold a place, `NULL` cannot
# What the heck is Not a Number?
- `NaN` is even more special, and only holds things like imaginary numbers
- `NaN` stands for "Not a Number"
```{r nan, warning=TRUE}
b<-1
b<-sqrt(-b)
print(b)
pi/0
```
- **Inf** is a special case as well representing an infinite value
- Just for fun `sin(Inf)` = `NaN`
# Beginning Analysis
- Now let's set up our analysis project
- It is best to keep projects discrete in directories
- Create a few subdirectories
- Data
- Functions / R src
- Figures / Plots
- Cleaned Data
- We've already done this in the Bootcamp folder itself!
- See the `ProjectTemplate` package for a more detailed philosophy about organizing projects
# Organization of a project is key
- Separate data from scripts
- Separate automatic scripts from interactive scripts
- Put figures apart from both of these
- **Always keep your raw data**
# Create our Project
- Open the "bootcamp" folder on your machine
- Underneath this make a a "figure" folder to go with the "data" folder
- Create an R script in the "bootcamp" folder called "myscriptname.R"
- Put all the data files in the "data" folder
# Read in Data
- Reading in data is one of the trickiest issues for R
- This is because R is incredibly flexible and can handle data in almost any form including `.csv` `.dta` `.sas` `.spss` `.dat` and even `.xls` and `.xlsx` with some care
- So we have to carefully specify the data types to R so it can understand what form the data needs to take
- Compared to C this is great!
# CSV is Our Friend
- The easiest data type is .csv though Excel files can be read as well `r load('data/smalldata.rda')`
```{r readdata, eval=FALSE}
# Set working directory to the tutorial directory
# In RStudio can do this in "Tools" tab
setwd('~/GitHub/r_tutorial_ed')
#Load some data
df<-read.csv('data/smalldata.csv')
# Note if we don't assign data to 'df'
# R just prints contents of table
```
# Let's Check What We Got
```{r checkresults,echo=FALSE}
# Let's see what object types
# R assigned to our dataset
str(df[27:32])
```
# Always Check Your Data
- A few great commands:
```{r dim}
dim(df)
```
- `summary`
```{r summary}
summary(df[,1:5])
```
# Checking your data II
- `names`
```{r names}
names(df)
```
- `attributes` and `class`
```{r attributes}
names(attributes(df))
class(df)
```
- `str` which lists all data elements in an object and their type
# Data Warehouses, Oracle, SQL and RODBC
- Do you have data in a warehouse?
- RODBC can help
- You can query the data directly and bring it into R, saving time and hassle
- Makes your work reproducible, always start with a clean slate of data
- At DPI this can allow us to pull data directly from LDS or other databases using SQL queries
# An Example From DPI
- The basics of the RODBC package are easy to understand
```{r rodbc, eval=FALSE}
library(RODBC) # interface driver for R
channel<-odbcConnect("Mydatabase.location",uid="useR",pwd="secret")
# establish connection
# we can do multiple connections in the same R session
#
# WARNING: credentials stored in plain text unless you do some magic
#
table_list<-sqltables(channel,schema="My_DB")
#
# Get a list of tables in the connection
#
colnames(sqlFetch(channel,'My_DB.TABLE_NAME',max=1))
#
# get the column names of a table
#
datapull<-sqlQuery(channel,"SELECT DATA1, DATA2, DATA3 FROM My_DB.TABLE_NAME")
#
# execute some SQLquery, can paste any SQLquery as a string into this space
#
```
# Missing Data
- Let's add some missing data to our dataframe so we can see how missing data works
```{r munging}
random<-sample(unique(df$stuid),100)
random2<-sample(unique(df$stuid),120)
messdf<-df
messdf$readSS[messdf$stuid %in% random]<-NA
messdf$mathSS[messdf$stuid %in% random2]<-NA
```
- What is this code doing?
- How can we find out?
- Don't try this at home!
# Checking for Missing Data
- The `summary` function helps identify missing data
```{r summarym}
summary(messdf[,c('stuid','readSS','mathSS')])
nrow(messdf[!complete.cases(messdf),]) # number of rows with missing data
```
- To get rid of missing data, we can copy our data with all missing cases dropped using the `na.omit` function
```{r naomit}
cleandf<-na.omit(messdf)
nrow(cleandf)
```
# Now we have the data
- What next?
- We need to do some basic diagnostics on our data to understand the look and feel of it before we proceed
- Here are a few examples of scripts we could run to understand our data object
```{r datafirstcut}
dim(messdf)
str(messdf[,18:26])
```
# Looking at data structure
```{r datafirstcut2}
names(messdf)
```
- It looks like we have a number of `id` variables, this is useful and it is good to check if these variables have multiple rows per id or not and we do this using `length` and `unique`
```{r checkunique}
length(unique(messdf$stuid))
length(unique(messdf$schid))
length(unique(messdf$dist))
```
# Checking for Coding
- Data is coded using numeric or character representations of attributes - commonly things are coded using a 1 and 0 scheme or an A,B,C scheme
- With R we can check how our variables are coded very easily
```{r checkcoding}
unique(messdf$grade)
unique(messdf$econ)
unique(messdf$race)
unique(messdf$disab)
```
- Which are factors?
- Which are not?
# Next Steps
- In the next section we will learn to aggregate, explore, reshape, and recode data
- Questions?
# Exercises
1. Read in the CSV file from the T drive or the project folder
2. Read in the R data file from the T drive or the project folder
3. Read in the sample datafile. Find the `readSS` (reading scale score) for student 205 in grade 4.
4. Create a list of two attributes for each district in the `df` datafile.
5. Think about your own data warehouse environment. Could R interface with it? How?
# Other References
- [UCLA Academic Technology Services: Reading in Raw Data](http://www.ats.ucla.edu/stat/r/pages/raw_data.htm)
- [Quick-R: Data Import](http://www.statmethods.net/input/importingdata.html)
- [Video Tutorials](http://www.twotorials.com/)
- [Long CRAN Manual on Data Import/Export](http://cran.r-project.org/doc/manuals/R-data.html)
# Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
```
# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>