freely available online: R for Data Science
read more
special purpose programming language for data science statistical computing
authority says to tell you to not think of R as a programming language!
think of it as a tool optimized for creating scripts to manipulate, plot and analyze data
a lot of innovation and development takes place in packages
go browse some 12,000 packages on CRAN
install packages (only once)
load packages (for every session)
integrated development environment for R
for all base R stuff, check the R manual
## [1] 42
## [1] 2 3 4
qplot {ggplot2} R Documentation
Quick plot
Description
qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.
Usage
qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
main = NULL, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)
## [1] "double"
c()
## [1] 10 20 30
## [1] 1 1
## [1] 20
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] 1 3 5
## [1] "character"
## [1] "huhu" "hello" "huhu" "ciao"
## [1] huhu hello huhu ciao
## Levels: ciao hello huhu
## [1] huhu hello huhu ciao
## Levels: huhu < ciao < hello
exp.data = data.frame(trial = 1:5,
condition = factor(c("C1", "C2", "C1",
"C3", "C2"),
ordered = T),
response = c(121, 133, 119, 102, 156))
exp.data
## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3
## trial condition response
## 3 3 C1 119
## # A tibble: 5 x 3
## trial condition response
## <int> <ord> <dbl>
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
mydist
is associated with four functions:
dmydist(x, ...)
gives the probability (mass/density) \(f(x)\) for x
pmydist(x, ...)
gives the cumulative distribution function \(F(x)\) for x
qmydist(p, ...)
gives the value \(x\) for which p = pmydist(x, ...)
rmydist(n, ...)
returns n
samples from the distribution## IQ RT
## 113.75 75.75
## IQ RT
## 113.75 75.75
## IQ RT
## 25 40
data from experimental (psych) studies is usually rectangular data
examples of (usually) non-rectangular data:
the tidyverse is particularly efficient for dealing with tidy rectangular data
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
study Chapters 5 and 12 from R for Data Science
this is untidy if we want to analyze/plot grade
as a function of exam
type
grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
midterm = c(3.7, 1.0, 1.3),
final = c(4.0, 1.3, 1.0))
grades
## # A tibble: 3 x 3
## name midterm final
## <chr> <dbl> <dbl>
## 1 Michael 3.7 4
## 2 Noa 1 1.3
## 3 MadEye 1.3 1
to tidy up, we need to gather columns which are not separate variables into a new column
## # A tibble: 6 x 3
## name exam grade
## <chr> <chr> <dbl>
## 1 Michael midterm 3.7
## 2 Noa midterm 1
## 3 MadEye midterm 1.3
## 4 Michael final 4
## 5 Noa final 1.3
## 6 MadEye final 1
this is untidy if we want to analyze grade
as a function of participation
results = tibble(name = c('Michael', 'Noa', 'MadEye',
'Michael', 'Noa', 'MadEye'),
what = rep(c('grade', 'participation'),
each = 3),
howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results
## # A tibble: 6 x 3
## name what howmuch
## <chr> <chr> <dbl>
## 1 Michael grade 3.7
## 2 Noa grade 1
## 3 MadEye grade 1
## 4 Michael participation 55
## 5 Noa participation 100
## 6 MadEye participation 100
to tidy up, we need to spread cells from a row out over several columns
## # A tibble: 3 x 3
## name grade participation
## <chr> <dbl> <dbl>
## 1 MadEye 1 100
## 2 Michael 3.7 55
## 3 Noa 1 100
“Some of the circles are black.”
dummy
## Parsed with column specification:
## cols(
## id = col_integer(),
## language = col_character(),
## rt = col_integer(),
## type = col_character(),
## response = col_integer(),
## nr_black = col_integer(),
## variant = col_character(),
## comments = col_character()
## )
## # A tibble: 5,112 x 8
## id language rt type response nr_black variant comments
## <int> <chr> <int> <chr> <int> <int> <chr> <chr>
## 1 1 English 3930 filler 1 5 C No
## 2 1 English 3108 most 0 5 C No
## 3 1 English 2599 filler 1 8 C No
## 4 1 English 4405 many 1 7 C No
## 5 1 English 2574 some 1 6 C No
## 6 1 English 1917 filler 1 3 C No
## 7 1 English 2471 filler 0 3 C No
## 8 1 English 2495 many 0 6 C No
## 9 1 English 2093 some 1 9 C No
## 10 1 English 1767 filler 0 2 C No
## # ... with 5,102 more rows
## [1] "bonuses always help with a toddler in the home ;)"
## [2] "Cheers."
## [3] "cool!"
## [4] "Easy HIT thanks!"
## [5] "Everything worked fine, thanks"
## [6] "fun"
## [7] "Fun and interactive. Thank you!"
## [8] "fun fun fun"
## [9] "Fun study"
## [10] "Fun study, thanks"
## [11] "fun survey"
## [12] "Fun, thanks!"
## [13] "good hit"
## [14] "Good luck with your research!"
## [15] "great hit"
## [16] "Great hit, good luck with your research."
## [17] "Great HIT!"
## [18] "Great survey. Thank you!"
## [19] "Hi"
## [20] "I accidentally clicked \"false\" on one of the \"some are black\" statements. It was the one where around half were black"
##
## American English Egnlish Enblish Englashi
## 8 13 8 8
## english English ENGLISH englsih
## 2410 2457 103 26
## englush Enlglish FRENCH Japanese
## 13 8 13 8
## Russian Spanish Tamil white
## 8 8 8 13
d = dplyr::filter(d, ! language %in% c("FRENCH", "Japanese", "Russian", "Spanish", "Tamil", "white"))
table(d$language)
##
## American English Egnlish Enblish Englashi
## 8 13 8 8
## english English ENGLISH englsih
## 2410 2457 103 26
## englush Enlglish
## 13 8
## # A tibble: 1,449 x 5
## id rt response nr_black variant
## <int> <int> <int> <int> <chr>
## 1 1 2574 1 6 C
## 2 1 2093 1 9 C
## 3 1 2543 1 3 C
## 4 2 1857 4 5 B
## 5 2 11454 4 10 B
## 6 2 2053 4 3 B
## 7 3 1479 1 10 D
## 8 3 1640 0 0 D
## 9 3 1199 1 7 D
## 10 4 4828 6 6 B
## # ... with 1,439 more rows
## # A tibble: 1,449 x 5
## id rt response condition variant
## <int> <int> <int> <int> <chr>
## 1 1 2574 1 6 C
## 2 1 2093 1 9 C
## 3 1 2543 1 3 C
## 4 2 1857 4 5 B
## 5 2 11454 4 10 B
## 6 2 2053 4 3 B
## 7 3 1479 1 10 D
## 8 3 1640 0 0 D
## 9 3 1199 1 7 D
## 10 4 4828 6 6 B
## # ... with 1,439 more rows
d = d %>% dplyr::mutate(dependent.measure = ifelse(variant %in% c("A", "B"), "ordinal", "binary"),
alternatives = factor(ifelse(variant %in% c("A", "C"), "present", "absent"))) %>%
dplyr::select(- variant)
d
## # A tibble: 1,449 x 6
## id rt response condition dependent.measure alternatives
## <int> <int> <int> <int> <chr> <fct>
## 1 1 2574 1 6 binary present
## 2 1 2093 1 9 binary present
## 3 1 2543 1 3 binary present
## 4 2 1857 4 5 ordinal absent
## 5 2 11454 4 10 ordinal absent
## 6 2 2053 4 3 ordinal absent
## 7 3 1479 1 10 binary absent
## 8 3 1640 0 0 binary absent
## 9 3 1199 1 7 binary absent
## 10 4 4828 6 6 ordinal absent
## # ... with 1,439 more rows
d = d %>% mutate(response = purrr::map2_dbl(dependent.measure, response,
function(x,y) { ifelse(x == "ordinal", (y-1)/6, y) } ))
d
## # A tibble: 1,449 x 6
## id rt response condition dependent.measure alternatives
## <int> <int> <dbl> <int> <chr> <fct>
## 1 1 2574 1 6 binary present
## 2 1 2093 1 9 binary present
## 3 1 2543 1 3 binary present
## 4 2 1857 0.5 5 ordinal absent
## 5 2 11454 0.5 10 ordinal absent
## 6 2 2053 0.5 3 ordinal absent
## 7 3 1479 1 10 binary absent
## 8 3 1640 0 0 binary absent
## 9 3 1199 1 7 binary absent
## 10 4 4828 0.833 6 ordinal absent
## # ... with 1,439 more rows
## # A tibble: 2 x 2
## dependent.measure mean.response
## <chr> <dbl>
## 1 binary 0.785
## 2 ordinal 0.600
resp.summary = d %>% dplyr::group_by(dependent.measure, alternatives, condition) %>%
dplyr::summarize(mean.response = mean(response))
resp.summary
## # A tibble: 44 x 4
## # Groups: dependent.measure, alternatives [?]
## dependent.measure alternatives condition mean.response
## <chr> <fct> <int> <dbl>
## 1 binary absent 0 0.0909
## 2 binary absent 1 0.478
## 3 binary absent 2 0.778
## 4 binary absent 3 0.958
## 5 binary absent 4 0.964
## 6 binary absent 5 1
## 7 binary absent 6 0.938
## 8 binary absent 7 0.98
## 9 binary absent 8 0.929
## 10 binary absent 9 0.967
## # ... with 34 more rows
ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
geom_point()
ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
geom_point() + geom_line() + facet_grid( . ~ dependent.measure)
ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
geom_point() + geom_line() + facet_grid( . ~ dependent.measure) +
xlab("number of black balls") + ylab("mean response") +
scale_x_continuous(breaks = 0:10) + scale_color_manual(values = c("darkgrey", "firebrick"))
prepare, analyze & plot data right inside your document
export to a variety of different formats
headers & sections
emphasis, highlighting etc.
extension of markdown to dynamically integrate R output
multiple output formats:
cheat sheet and a quick tour
inline equations with $\theta$
equation blocks with
$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$
caveat
LaTeX-style formulas will be rendered differently depending on the output method:
do it all in one file BDACM_HW1-LastnameFirstname.Rmd
use a header that generate HTML files like this:
---
title: "My flawless first homework set"
date: 2018-11-30
output: html_document
---
follow the instructions given in the first homework assignment
send the *.Rmd
and the *.HTML
as a *.zip
avoid using extra material not included in the *.Rmd