-
Notifications
You must be signed in to change notification settings - Fork 63
/
Tutorial1_Intro.Rmd
983 lines (755 loc) · 29.2 KB
/
Tutorial1_Intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
---
title : Tutorial 1- Getting Started
subtitle : DPI R Bootcamp / 2013
author : Jared Knowles
job : Research Analyst, Wisconsin DPI
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : monokai #
widgets : [] # {mathjax, quiz, bootstrap}
mode : standalone # {standalone, draft}
---
<style>
em {
font-style: italic
}
strong {
font-weight: bold;
}
pre {
font-size: 18px;
line-height: 20px;
}
</style>
```{r setup, include=FALSE}
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE,
comment=NA)
# upload images automatically
#opts_knit$set(upload.fun = imgur_upload)
```
## Overview
- What is R?
- What is RStudio?
- How does it work?
- What makes the language different?
- Why learn it?
<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>
---
## R
- R is an Open Source (and freely available) environment for statistical computing and graphics
- Available for Windows, Mac OS X, and Linux
- R is being actively developed with two major releases per year and dozens of releases of add on packages
- R can be extended with 'packages' that contain data, code, and documentation to add new functionality
---
## What Does it Look Like?
The R workspace in RStudio
<p align="center"><img src="img/workspacescreen.png" height="500" width="700"></p>
---
## A Bit of Histo**R**y
- R is a flavor of the **S** computer language
- S was developed by John Chambers at Bell Labs in the late 1970s
- In 1988 it was rewritten from a Fortran base to a C base
- Version 4 of S, the latest version, was finished in 1998, and won several awards
--- quote
## John Chambers on the Philosophy of S
<p><q>[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as <span class = 'red'>programming</span>. Then ... they should be able to <span class = 'red'>slide gradually</span> into programming, when the language and system aspects would become more important.</p></q>
---
## R is Born
- 1991 in New Zealand Ross Ihaka and Robert Gentleman create R
- Named for their first initials
- R is made public in 1993, and in 1995 Martin Maechler convinces the creators to make it open source with the GNU General Public License
- 1997 R Core Group is formed--the maintainers and main developers of R (about 14 members today)
- 2000 version 1.0.0 ships
- 2012 version 2.15.2 is available (2.16.0 is due out early in 2013)
---
## Why Use R
- R is a common tool among data experts at major universities
- No need to go through procurement, R can be installed in any environment on any machine and used with no licensing or agreements needed
- R source code is very readable to increase transparency of processes
- R code is easily borrowed from and shared with others
- R is incredibly flexible and can be adapted to specific local needs
- R is under incredibly active development, improving greatly, and supported wildly by both professional and academic developers
---
## Thoughts on Free
- R is free in many senses
1. R can be run and used for any purpose, commercial or non-commercial, profit or not-for-profit
2. R's source code is freely available so you can study how it works and adapt it to your needs.
3. R is free to redistribute so you can share it with your ~~enemies~~ friends
4. R is free to modify and those modifications are free to redistribute and may be adopted by the rest of the community!
---
## R Advantages Continued
- R is platform agnostic - Linux, Mac, PC, server, desktop, etc.
- R can output results in a variety of formats
- R can build routines straight out of a database for common and universal reporting
### R Can Compliment Other Tools
- R plays nicely with data from Stata, SPSS, SAS and others
- R can check work, produce output, visualize results from other programs
- R can do bleeding edge analyses that aren't available in proprietary packages yet
- R is becoming more prevalent in undergraduate statistics courses - more and more potential employees are learning it each year
---
## R's Popularity
R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS
<p align="center"><img src="img/googlescholar.png" height="450" width="625"></p>
---
## R Has an Active Web Presence
R is linked to from more and more sites
<p align="center"><img src="img/sitelinks.png" height="500" width="600"></p>
---
## R Extensions
These links come from the explosion of add-on packages to R
<p align="center"><img src="img/addons.png" height="500" width="600"></p>
---
## R Has an Active Community
Usage of the R listserv for help has really exploded recently
<p align="center"><img src="img/rlistserv.png" height="400" width="560"></p>
Data from Bob Muenchen [available online](http://r4stats.com/articles/popularity/)
---
## R's Drawbacks
- R is based on S, which is close to 40 years old
- R only has features that the community contributes
- Not the ideal solution to all problems
- R is a programming language and not a software package--steeper learning curve
- R can be much slower than compiled languages
---
## R Vocabulary
- **packages** are add on features to R that include data, new functions and methods, and extended capabilities. Think of them as ``apps'' on your phone. We've already installed several!
- **terminal** this is the main window of R where you enter commands
- **scripts** these are where you store commands to be run in the terminal later, like syntax files in SPSS or .do files in Stata
- **functions** commands that do something to an object in R
- **dataframe** the main element for statistical purposes, an object with rows and columns that includes numbers, factors, and other data types
- **workspace** the working memory of R where all objects are stored
- **vector** the basic unit of data in R
- **symbols** used to name and store objects or to designate operations/functions
- **attributes** determine how functions act on objects
---
## Components of an R Setup
- **R** - R works in the command line of any OS, but also comes with a basic GUI to operate on its own in Windows and Mac [download](http://cran.r-project.org/)
- **RStudio** - a much better way to work in R that allows editing of scripts, operation of R, viewing of the workspace, and R help all on one screen [download](http://rstudio.org/download/)
- We've already installed and configured these, but what might you want to use to go further after the Bootcamp is complete?
---
## Advanced R Setup
- **LaTeX** - for producing documents using R this is less necessary, but still useful. download [WIN](http://miktex.org/2.9/setup) [MAC](http://www.tug.org/mactex/2011/)
- **Dev Tools for R** - on Windows this is Rtools, on Linux and Mac it is installing the development mode of R download [WIN](http://www.stats.ox.ac.uk/pub/Rtools/R215x.html) [MAC](http://cran.r-project.org/bin/macosx/tools/)
- **Git** - for version control, sharing code, and collaboration this is essential. It integrates well with RStudio. [download](http://git-scm.com/download)
- **pandoc** - for converting output into other formats for sharing with non-user**R**s! [download](http://johnmacfarlane.net/pandoc/installing.html)
- **ImageMagick** - for creating more flexible graphics in R, including animations! [download](http://www.imagemagick.org/script/index.php) [alternate](http://www.graphicsmagick.org/)
---
## Open Source Toolchain
- This really represents a completely open source toolchain to going from a data analysis idea, to a full fledged professional report
- These tools are free, updated regularly, and available on **any** platform **today**
- This toolchain is scriptable, reproducible, and has been around for quite some time so it is stable and robust
- Lots of other programming languages use aspects of this toolchain
---
## Some Notes about Maintaining R
- Adding packages onto R means you also have to update them with the `update.packages()` command
- Package updates are at the whim of the package developer which may be a corporation, an R Core Member, an academic, or me
- Upgrading R on Windows, which is on a 6 month release cycle, is not straightforward yet, but there are lots of how-to guides online
- Generally, upgrading is recommended as there are usually a lot of performance enhancements, even with minor (2.15.X) releases
- We will walk through this a bit later, but remember that the flexibility in R means that users probably need to be self-supported in your organization
---
## Self-help
- In the spirit of open-source R is very much a self-guided tool
- Let's see, type: `?summary`
- Now type: `??regression`
- For tricky questions, funky error messages (there are many), and other issues, use Google (include "in R" to the end of your query)
- We can also use [RSeek](http://www.rseek.org/) - the search engine just for R!
- StackOverflow has become a great resource with many questions for many specific packages in R, and a rating system for answers
- A number of R Core members contribute there
---
## Self-help (2)
- Sometimes R Help can be a bit prickly and unfriendly...
- The most important part of getting help is being able to ask your question with a reproducible example (i.e. some short simulated data and code that doesn't do what you want)
- Like this:
```{r echo=TRUE,error=TRUE}
foo<-c(1,"b",5,7,0)
bar<-c(1,2,3,4,5)
foo+bar
```
- For R Help etiquette (for the tough problems) see [the great advice here](https://github.com/hadley/devtools/wiki/Reproducibility)
---
## Let's Look at RStudio
- RStudio has made R more accessible and easy to use than ever!
- Open up **Tutorial1.R**
- 4 panels, various tabs
- Help, plots, file structure
- Workspace, history, version control
- Working files
- The R Console
---
## The Data Frame
- What happens when you type:
```{r dataframeintro,echo=TRUE,results='hide'}
data(mtcars)
mtcars
```
```{r dataframeintro2,echo=FALSE,results='markup'}
mtcars[1:8,]
```
- This is how we'll look at R commands and their results the rest of the way
---
## A Data Frame
- A data frame is a lot like a traditional spreadsheet and is the central feature of data analysis in R
- You'll see we have names, stored with numbers, which other programming languages are not so keen on doing
- This is really handy, and data frames are the primary method we use to store data, manipulate data, etc.
- The difference between a data frame and spreadsheet in Excel is that in almost all cases, everything we do to the data frame will either be by entire row, or by entire column
- This sounds limiting, but in fact, it is incredibly powerful!
--- bg:black
## R As A Calculator
```{r computing,echo=TRUE,results='markup'}
2+2 ## add numbers
2*pi #multiply by a constant
7+runif(1,min=0,max=1) #add a random variable
4^4 # powers
sqrt(4^4) # functions
```
---
## Arithmetic Operators
- In addition to the obvious `+` `-` `=` `/` `*` and exponential `^`, there is also integer division `%/%` and remainder in integer division (known as modulo arithmetic) `%%`
```{r arithmetic}
2+2
2/2
2*2
2^2
2==2
23 %/% 2
23 %% 2
```
---
## Other Key Symbols
- `<-` is the assignment operator, it declares something is something else
```{r}
foo<-3
foo
```
- `:` is the sequence operator
```{r}
1:10
# it increments by one
a<-100:120
a
```
- **This is handy**
---
## Comments in R
- **#** denotes a comment in R
- Anything after the **#** is not evaluated and ignored in R
- This is handy for making things reproducible
```{r poundsigns,eval=FALSE,echo=TRUE,tidy=FALSE}
# Something I want to keep from R
# Like my secret from the R engine
# Maybe intended for a human and not the computer
# Like: Look at this cool plot!
myplot(readSS,mathSS,data=df)
```
---
## R Advanced Math
- R also supports advanced mathematical features and expressions
- R can take integrals and derivatives and express complex functions
- Easiest of all, R can generate distributions of data very easily
- e.g. `rnorm(100)` or `rbinom(100)`
- This comes in handy when writing examples and building analyses because it is trivial to generate a synthetic piece of data to use as an example
- Go ahead, try typing `hist(rnorm(10000))` into RStudio
---
## Using the Workspace
- To do more we need to learn how to manipulate the 'workspace'.
- This includes all the vectors, datasets, and functions stored in memory.
- All R objects are stored in the memory of the computer, limiting the available space for calculation to the size of the RAM on your machine.
- R makes organizing the workspace easy.
---
## Using the Workspace (2)
```{r}
x<-5 #store a variable with <-
x #print the variable
z<-3
ls() #list all variables
ls.str() #list and describe variables
rm(x) # delete a variable
ls()
```
---
## R as a Language
- R is more than statistical software, it is a computer language
- Like any language it has rules (some poorly enforced), and conventions
- You will learn more as you go, but we'll go over a few to start
1. Case sensitivity matters
```{r}
a<-3
A<-4
print(c(a,A))
```
* <font color="red">**a** ≠ **A**</font>
2. What happens if I type **print(a,A)**?
---
## `c` is our friend
- So what does **c** do?
```{r}
A<-c(3,4)
print(A)
```
- `c` stands for concatenate and allows vectors to have multiple elements
- If you ever need two elements in a vector, you need to wrap it up in `c`, which is one of the most used functions you will ever use
- `c` is important to put any vector together, but remember that objects within a vector must all be of the same type
---
## Language
- In language there are a number of ways to say the same thing
* <font color="green">The dog chased the cat.</font>
* <font color ="blue">The cat was chased by the dog.</font>
* <font color ="red">By the dog, the cat was chased.</font>
- Some ways are more elegant than others, all convey the same message.
```{r language}
a<-runif(100) # Generate 100 random numbers
b<-runif(100) # 100 more
c<-NULL # Setup for loop (declare variables)
for(i in 1:100){ # Loop just like in Java or C
c[i]<-a[i]*b[i]
}
d<-a*b
identical(c,d) # Test equality
```
- Which is nicer? **c** or **d**?
---
## More Language ~~Bugs~~ Features
- R is maddeningly inconsistent in it's naming conventions
* Some functions are `camelCase`; others `are.dot.separated`; others `use_underscores`
* Function results are stored in a variety of ways across function implementations
* R has multiple graphics packages that different functions use for default plot construction (`base`, `grid`, `lattice`, and `ggplot2`)
* R has multiple packages and functions to do the same analysis as well, though some standardization has started to occur
* Be flexible and be aware of R's flexibility
```{r eval=TRUE,echo=FALSE}
load("data/smalldata.rda")
```
---
## Objects
- Everything in R is an object - even functions
- Objects can be manipulated many ways
- A common example is applying the `summary` function to a variety of object types and seeing how it adapts
```{r}
summary(df[,28:31]) #summary look at df object
summary(df$readSS) #summary of a single column
```
-The `$` says to look for object **readSS** in object **df**
---
## Graphics too
```{r graphics1,message=FALSE,dev='svg',fig.cap='Student Test Scores', fig.width=8.2, fig.height=4.3,tidy=FALSE,warning=FALSE}
library(ggplot2) # Load graphics Package
library(eeptools)
qplot(readSS,mathSS,data=df,geom='point',alpha=I(0.3))+theme_dpi()+
opts(title='Test Score Relationship')+
geom_smooth()
```
---
## Handling Data in R
- R handles data differently than many other statistical packages
- In R, all elements are objects
```{r}
length(unique(df$school))
length(unique(df$stuid))
uniqstu<-length(unique(df$stuid))
uniqstu
```
- Results of function calls can be stored
---
## Special Operators
- The comparison operators `<`, `>`, `<=`, `>=`, `==`, and `!=` are used to compare values across vectors
```{r vectorcomp}
big<-c(9,12,15,25)
small<-c(9,3,4,2)
# Give us a nice vector of logical values
big>small
big=small
# Oops--don't do this, reassigns big to small
print(big)
print(small)
```
- Comparison operators can be tricky, so to keep it straight never use `=` or `==` to assign anything, always use `<-`
---
## Special Operators II
- The best way to evaluate these objects is to use brackets `[]` to avoid confusion
```{r brackets}
big<-c(9,12,15,25)
big[big==small]
# Returns values where the logical vector is true
big[big>small]
big[big<small] # Returns an empty set
```
---
## Special operators (III)
- The `%in%` operator determines whether each value in the left operand can be matched with one of the values in the right operand.
```{r specialoperand}
big<-c(9,12,15,25)
small<-c(9,12,15,25,9,1,3)
big[small %in% big]
```
- 9, 12, 15, and 25 all appear in `big`, but `small` also has objects that do not appear in `big` and so an NA is returned
- What if we reverse this?
```{r operand2}
big[big %in% small]
```
- No `NA`
---
## Special operators (IV)
- The logical operators `|` (or) and `&` (and) can be used to combine two logical values and produce another logical value as the result. The operator `!` (not) negates a logical value. These operators allow complex conditions to be constructed.
```{r vectorlogic}
foo<-c('a',NA,4,9,8.7)
!is.na(foo) # Returns TRUE for non-NA
class(foo)
a<-foo[!is.na(foo)]
a
class(a)
```
---
## Special operators (V)
- The operators `||` and `&&` are similar, but they combine two logical vectors. The comparison is performed element by element, so the result is also a logical vector.
```{r vectorlogic2}
zap<-c(1,4,8,2,9,11)
zap[zap>2 | zap<8]
zap[zap>2 & zap<8]
```
---
## Regular Expressions
- R also supports a full suite of regular expressions
- This could be material for a full tutorial in a more advanced bootcamp
- If you know and use regex, then rest assured you can keep using it in R
---
## R Data Modes
- R allows users to implement a number of different types of data
- The three basic data types are numeric data, character data, and logical data
- Vectors must be of one consistent type of data, so if you make a vector with multiple types, it generally defaults to being a character vector
---
## Data Modes in R (numeric)
- **numeric** vectors contain, as you would guess, numbers!
```{r}
is.numeric(A)
class(A)
print(A)
```
---
## Data Modes (Character)
- **character** is known as strings in other software, any characters that have no numeric meaning
```{r}
b<-c('one','two','three')
print(b)
is.numeric(b)
```
---
## Data Modes (Logical)
- **logical** is TRUE or FALSE values, useful for logical testing and programming
- We've already seen these returned when we have asked R a question before
```{r}
c<-c(TRUE,TRUE,TRUE,FALSE,FALSE,TRUE)
is.numeric(c)
is.character(c)
is.logical(c) ## Results in a logical value
```
---
## Easier way
- Just ask R using the `class` function
```{r class}
class(A)
class(b)
class(c)
```
---
## A Note on Vectors
- Vectors are collections of consistent data types
- **numeric** can either be double or integer depending on the *bytes* size
- **logical**
- **character**
- **complex**
- **raw**
- All vectors must be consistent among types, but some data objects like data frames can combine multiple vectors of different types
---
## Factor
- A factor is a very special and sometimes frustrating data type in R
```{r fac}
myfac<-factor(c("basic","proficient","advanced","minimal"))
class(myfac)
myfac # What order are the factors in?
```
- What if we don't like the order these are in? Factor order is important for all kinds of things like plot type, regression output, and more
---
## Ordering the Factor
- Ordered factors simply have an additional attribute explaining the order of the levels of a factor
- This is a useful shortcut when we want to preserve some of the meaning provided by the order
- Think cardinal data
```{r orderedfac}
myfac_o<-ordered(myfac,levels=c("minimal","basic","proficient","advanced"))
myfac_o
summary(myfac_o)
```
---
## Reclassifying Factors
- Turning factors into other data types can be tricky. All factor levels have an underlying numeric structure.
```{r fac2}
class(myfac_o)
unclass(myfac_o)
defac<-unclass(myfac_o)
defac
```
- What is wrong with this? Well--why would `minimal` be `2` and `basic` be `3`?
- Be careful! The best way to unpack a factor is to convert it to a character first.
---
## Defactor
```{r fac2.1,eval=FALSE,echo=TRUE}
# From the eeptools package
defac<-function(x){
x<-as.character(x)
x
}
```
```{r fac3}
defac(myfac_o)
defac<-defac(myfac_o)
defac
```
---
## Convert to Numeric?
- What if we do want it to be numeric?
- The best way to do this is to recode the variable manually--we'll discuss this later
- You can try to convert it to numeric though, but do at your own risk:
```{r numericfac}
myfac_o
as.numeric(myfac_o)
```
- If we did not properly specify the order above, this would be wrong!
```{r numericfacwrong}
myfac
as.numeric(myfac)
```
---
## Dates
- R has built-in ways to handle dates
- See `lubridate` package for more advanced functionality including mathematical operations on dates
```{r dates}
mydate<-as.Date("7/20/2012",format="%m/%d/%Y")
# Input is a character string and a parser
class(mydate) # this is date
weekdays(mydate) # what day of the week is it?
mydate+30 # Operate on dates
```
---
## More Dates
```{r moredates1}
# We can parse other formats of dates
mydate2<-as.Date("8-5-1988",format="%d-%m-%Y")
mydate2
mydate-mydate2
# Can add and subtract two date objects
```
---
## A few notes on dates
- R converts all dates to numeric values, like Excel and other languages
- The origin date in R is January 1, 1970
```{r moredates}
as.numeric(mydate) # days since 1-1-1970
as.Date(56,origin="2013-4-29") # we can set our own origin
```
---
## Other Classes
- R classes can be specified for any special purpose
- Like linear models
```{r linmod}
b<-rnorm(5000)
c<-runif(5000)
a<-b+c
mymod<-lm(a~b)
class(mymod)
```
---
## Why care so much about classes?
- Classes determine what you can and can't do with objects
- Different classes have different computational times associated with them, so the choice can affect the optimization of the code
- Classes allow you to keep projects/data organized
- **Because R makes you care**
---
## Data Structures in R
- R has a number of basic data classes as well as arbitrary specialized object types for various purposes
- **vectors** are the basic data class in R and can be thought of as a single column of data (even a column of length 1)
- **matrices and arrays** are rows and columns of all the same mode data
- **dataframes** are rows and columns where the columns can represent different data types
- **lists** are arbitrary combinations of disparate object types in R
---
## Vectors
- Everything is a vector in R, even single numbers
- Single objects are "atomic" vectors
```{r vectors}
print(1)
# The 1 in braces means this element is a vector of length 1
print("This tutorial is awesome")
# This is a vector of length 1 consisting of a single "string of characters"
```
---
## Vectors 2
```{r vectors2}
print(LETTERS)
# This vector has 26 character elements
print(LETTERS[6])
# The sixth element of this vector has length 1
length(LETTERS[6])
# The length of that element is a number with length 1
```
---
## Matrices
- Matrices are combinations of vectors of the same length and data type
- We can have numeric matrices, character matrices, or logical matrices
- Can't mix types
```{r matrix}
mymat<-matrix(1:36,nrow=6,ncol=6)
rownames(mymat)<-LETTERS[1:6]
colnames(mymat)<-LETTERS[7:12]
class(mymat)
```
---
## Matrices II
```{r matrix2}
rownames(mymat)
colnames(mymat)
mymat
```
---
## More Matrices
- We can add to matrices
```{r matrix3}
dim(mymat) # We have 6 rows and 6 columns
myvec<-c(5,3,5,6,1,2)
length(myvec) # What happens when you do dim(myvec)?
newmat<-cbind(mymat,myvec)
newmat
```
- Dataframes work similarly
---
## Matrix Functions
- We can do some basic math to matrices as well, like correlations
```{r matrix4}
foo.mat<-matrix(c(rnorm(100),runif(100),runif(100),rpois(100,2)),ncol=4)
head(foo.mat)
cor(foo.mat)
```
- The result is a matrix itself, but we can force it to be something else
---
## Converting Matrices
- Let's make a matrix be a dataframe
```{r matrixdance}
mycorr<-cor(foo.mat)
class(mycorr)
mycorr2<-as.data.frame(mycorr)
class(mycorr2)
mycorr2
```
---
## Arrays
- Arrays are a set of matrices of the same `dim` and `class`
- Arrays allow dimensions to be named
```{r array}
myarray<-array(1:42,dim=c(7,3,2),dimnames=list(c("tiny","small",
"medium","medium-ish","large","big","huge"),
c("slow","moderate","fast"),c("boring","fun")))
class(myarray)
dim(myarray)
```
---
## Arrays II
```{r array2}
dimnames(myarray)
myarray
```
---
## Lists
- Lists are arbitrary collections of objects.
- The objects do not have to be of the same type or same element or same dimensions
```{r lists}
mylist<-list(vec=myvec,mat=mymat,arr=myarray,date=mydate)
class(mylist)
length(mylist)
names(mylist)
```
---
## Print a List
```{r listprint}
str(mylist)
```
---
## Lists (II)
- R has two object classification schemes S3 and S4
- For S3 use `$` or `[[]]` to extract elements
- For S4 use `@` to extract elements
```{r lists2}
mylist$vec
mylist[[2]][1,3]
```
- Where are we getting the object in the second row from?
---
## So what?
- Matrices, lists, and arrays are useful for storing analyses results, generating reports, and doing analysis on many objects types
- We'll see examples of list and array manipulation later
- A useful tip is to use the `attributes` function to learn about the object
```{r attr}
attributes(mylist)
attributes(myarray)[1:2][2]
```
- They also provide simplified ways to get used to operating on dataframes by reducing complexity
---
## Dataframes
- Dataframes are combinations of vectors of the same length, but can be of different types
```{r dataframetypes,echo=TRUE,results='markup'}
str(df[ ,25:32])
```
- Data frames must have consistent dimensions
- Dataframes are what we use most commonly as a "dataset" for analysis
- Dataframes are what sets R apart from other programming languages like C, C++, Python, and Perl.
- The dataframe structure is much more complex and much easier to use than any datastructure in these languages - though Python is catching up!
---
## Converting Between Types
- R has built in functions to allow you to force objects to move between types
- These follow the general form `as.whatIwant` as in `as.factor` or `as.table` or `as.data.frame`
- You will use these commands a lot to convert output from various functions into a form you can input into a different function
- A good example is converting correlation matrices into dataframes so we can plot them
---
## Summing it Up
- Vectors are used to store bits of data
- Matrices are combinations of vectors of the same length and type
- Matrices are most commonly used in statistical models (in the background), and for computation
- Arrays are stacks of matrices and are used in building multiple models or for storing complex data structures
- Lists are groups of R objects commonly used to combine function output in useful ways (like store model results and model data together)
---
## Exercises
1. Create a matrix of 5x6 dimensions. Add a vector to it (as either a row or column). Identify element 2,3.
2. Convert the matrix to a data frame.
3. Look at the difference between data frames and matrices.
---
## Other References
- [An R Vocabulary for Starting Out](https://github.com/hadley/devtools/wiki/vocabulary)
- [UCLA Academic Technology Services: R Starter Kit](http://www.ats.ucla.edu/stat/r/sk/)
- [Quick R: Getting Started](http://www.statmethods.net/)
- [R Features List](http://www.revolutionanalytics.com/what-is-open-source-r/r-language-features/)
- [Video Tutorials](http://www.twotorials.com/)
- [Google's R Style Guide](http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html)
- [Hadley Wickham's R Style Guide](https://github.com/hadley/devtools/wiki/Style)
- [JSM 2012 R Introduction by Jay Emerson](http://www.stat.yale.edu/~jay/JSM2012/PDFs/intro.pdf)
- [Notes on R for Programmers by John Cook](http://www.johndcook.com/R_language_for_programmers.html)
---
## Books
- There are a number of great R books available for learning the beginnings of R and for learning specific statistical techniques
- "Discovering Statistics Using R" by Andy Field
- "Applied Econometrics with R" by Achim Zeileis and Christian Kleiber
- "The Art of R Programming" by Norman Matloff
- [The R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)
- The R Book by Michael J. Crawley (new version imminent)
- The R Cookbook
---
## Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
```
---
## Attribution and License
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />