Â
freely available online: R for Data Science
read more
special purpose programming language for data science statistical computing
authority says to tell you to not think of R as a programming language!
think of it as a tool optimized for creating scripts to manipulate, plot and analyze data
a lot of innovation and development takes place in packages
go browse some 12,000 packages on CRAN
Â
install packages (only once)
load packages (for every session)
Â
integrated development environment for R
for all base R stuff, check the R manual
## [1] 42
## [1] 2 3 4
Â
qplot {ggplot2} R Documentation
Quick plot
Description
qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.
Usage
qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
main = NULL, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)
## [1] "double"
c()
## [1] 10 20 30
## [1] 1 1
## [1] 20
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] 1 3 5
## [1] "character"
## [1] "huhu" "hello" "huhu" "ciao"
## [1] huhu hello huhu ciao
## Levels: ciao hello huhu
## [1] huhu hello huhu ciao
## Levels: huhu < ciao < hello
exp.data = data.frame(trial = 1:5,
condition = factor(c("C1", "C2", "C1",
"C3", "C2"),
ordered = T),
response = c(121, 133, 119, 102, 156))
exp.data
## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3
## trial condition response
## 3 3 C1 119
## # A tibble: 5 x 3
## trial condition response
## <int> <ord> <dbl>
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
 Â
mydist
is associated with four functions:
dmydist(x, ...)
gives the probability (mass/density) \(f(x)\) for x
pmydist(x, ...)
gives the cumulative distribution function \(F(x)\) for x
qmydist(p, ...)
gives the value \(x\) for which p = pmydist(x, ...)
rmydist(n, ...)
returns n
samples from the distribution## IQ RT
## 113.75 75.75
## IQ RT
## 113.75 75.75
## IQ RT
## 25 40
Â
data from experimental (psych) studies is usually rectangular data
Â
examples of (usually) non-rectangular data:
Â
the tidyverse is particularly efficient for dealing with tidy rectangular data
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
study Chapters 5 and 12 from R for Data Science
Â
this is untidy if we want to analyze/plot grade
as a function of exam
type
grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
midterm = c(3.7, 1.0, 1.3),
final = c(4.0, 1.3, 1.0))
grades
## # A tibble: 3 x 3
## name midterm final
## <chr> <dbl> <dbl>
## 1 Michael 3.7 4
## 2 Noa 1 1.3
## 3 MadEye 1.3 1
to tidy up, we need to gather columns which are not separate variables into a new column
## # A tibble: 6 x 3
## name exam grade
## <chr> <chr> <dbl>
## 1 Michael midterm 3.7
## 2 Noa midterm 1
## 3 MadEye midterm 1.3
## 4 Michael final 4
## 5 Noa final 1.3
## 6 MadEye final 1
this is untidy if we want to analyze grade
as a function of participation
results = tibble(name = c('Michael', 'Noa', 'MadEye',
'Michael', 'Noa', 'MadEye'),
what = rep(c('grade', 'participation'),
each = 3),
howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results
## # A tibble: 6 x 3
## name what howmuch
## <chr> <chr> <dbl>
## 1 Michael grade 3.7
## 2 Noa grade 1
## 3 MadEye grade 1
## 4 Michael participation 55
## 5 Noa participation 100
## 6 MadEye participation 100
to tidy up, we need to spread cells from a row out over several columns
## # A tibble: 3 x 3
## name grade participation
## <chr> <dbl> <dbl>
## 1 MadEye 1 100
## 2 Michael 3.7 55
## 3 Noa 1 100
for background see Wickham (2010)
ggplot
callfrom the cheat sheet
Â
prepare, analyze & plot data right inside your document
export to a variety of different formats
Â
Â
headers & sections
emphasis, highlighting etc.
extension of markdown to dynamically integrate R output
multiple output formats:
cheat sheet and a quick tour
inline equations with $\theta$
equation blocks with
$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$
Â
caveat
LaTeX-style formulas will be rendered differently depending on the output method: