Using R

Michael Franke

topics for today

basics of R
tidyverse
tidy data
data wrangling
plotting
Rmarkdown

R for data science

R4DS

R4DS cover

freely available online: R for Data Science

data science?

data scientist

special purpose programming language for ~~data science~~ statistical computing
- statistics, data mining, data visualization
authority says to tell you to not think of R as a programming language!
think of it as a tool optimized for creating scripts to manipulate, plot and analyze data

diagram from 'R for Data Science'

past & present

a trusted old friend from 1993
still thriving
- see TIOBE ranking (based on search query results)

TIOBE index

extensibility & community support

a lot of innovation and development takes place in packages

go browse some 12,000 packages on CRAN

install packages (only once)

install.packages('tidyverse')

load packages (for every session)

library(tidyverse)

base R & package functions

base R functionality is always available

x = seq(from = 1, to = 10, length.out = 1000)
plot(x,x^2)

packages bring extra functions

library(ggplot2)
ggplot2::qplot(x,x^2)

tidyverse

overview of tidyverse

tidyverse website

RStudio

integrated development environment for R

RStudio screenshot

cheat sheet

basics of R

overview

basic properties of R
data types
- numbers, vectors & matrices
- characters & factors
- lists, data.frames & tibbles
probability distributions
functional programming elements
functions

for all base R stuff, check the R manual

general remarks about R

free (GNU General Public License)
interpreted language

6 * 7

## [1] 42

vector/matrix based

x = c(1,2,3)
x + 1

## [1] 2 3 4

supports object-oriented, procedural & functional styles
convenient interfaces to other languages
assignment in both directions possible

x <- 3
3 -> y
x == y

## [1] TRUE

help

help('qplot')

qplot {ggplot2} R Documentation
Quick plot

Description

qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.  
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.

Usage

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
  main = NULL, xlab = deparse(substitute(x)),
  ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)

numbers, vectors & matrics

standard number precision is double

typeof(2)

## [1] "double"

vectors are declared using c()

x = c(10,20,30)
x

## [1] 10 20 30

everything is a vector (possibly length 1)

c(length(200), length("huhu"))

## [1] 1 1

indexing starts at 1

x[2]

## [1] 20

numbers, vectors & matrices (2)

column-major mode

m = matrix(c(1,2,3,4,5,6), nrow = 2)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

m[1,]

## [1] 1 3 5

vectors are column vectors

m %*% x ## dot product

##      [,1]
## [1,]  220
## [2,]  280

character vectors and factors

strings are called characters

typeof("huhu")

## [1] "character"

vector of characters

chr.vector = c("huhu", "hello", "huhu", "ciao")
chr.vector

## [1] "huhu"  "hello" "huhu"  "ciao"

factors track levels

factor(chr.vector)

## [1] huhu  hello huhu  ciao 
## Levels: ciao hello huhu

ordered factors arrange their levels

factor(chr.vector, ordered = T, 
       levels = c("huhu", "ciao", "hello"))

## [1] huhu  hello huhu  ciao 
## Levels: huhu < ciao < hello

lists & data frames

lists are key-value pairs

my.list = list(dudu = 1,
               chacha = c("huhu", "ciao"))

data frames as lists of same-length vectors

exp.data = data.frame(trial = 1:5,
              condition = factor(c("C1", "C2", "C1", 
                                   "C3", "C2"),
                                 ordered = T),
              response = c(121, 133, 119, 102, 156))
exp.data

##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

access colums

exp.data$condition

## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3

access rows

exp.data[3,]

##   trial condition response
## 3     3        C1      119

tibbles

tibbles are data frames in the tidyverse

as.tibble(exp.data)

## # A tibble: 5 x 3
##   trial condition response
##   <int> <ord>        <dbl>
## 1     1 C1             121
## 2     2 C2             133
## 3     3 C1             119
## 4     4 C3             102
## 5     5 C2             156

compare to previous data frame

exp.data

##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

some differences

my.tibble    = tibble(x = 1:10, y = x^2)      ## dynamic construction possible
my.dataframe = data.frame(x = 1:10, y = x^2)  ## ERROR :/

probability distributions in R

R has many built-in probability distributions
- normal distribution
- beta distribution
- …
additional distributions supplied by packages
- multi-variate normal
- Dirichlet
- …
each distribution mydist is associated with four functions:
1. dmydist(x, ...) gives the probability (mass/density) $f(x)$ for x
2. pmydist(x, ...) gives the cumulative distribution function $F(x)$ for x
3. qmydist(p, ...) gives the value $x$ for which p = pmydist(x, ...)
4. rmydist(n, ...) returns n samples from the distribution

example

x = seq(-5, 5, length.out = 1000)
y = dnorm(x, mean = 1, sd = 0.5)
plot(x,y)

maps & pipes (tidyverse)

mapping

data = tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) )
map_dbl(data, mean)

##     IQ     RT 
## 113.75  75.75

piping

tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(mean)

##     IQ     RT 
## 113.75  75.75

functions

named custom functions

crazy.operation = function(x,y) {
  x+y
}
crazy.operation(2,3)

## [1] 5

anonymous functions

tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(function(i) {max(i)-min(i)})

## IQ RT 
## 25 40

tidy (rectangular) data

types of data

data from experimental (psych) studies is usually rectangular data

examples of (usually) non-rectangular data:

image data
sound data
video data
corpora
…

the tidyverse is particularly efficient for dealing with tidy rectangular data

rectangular data

library(nycflights13)
nycflights13::flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

study Chapters 5 and 12 from R for Data Science

tidy data

each variable is a column
each observation is a row
each value is a cell

tidy data

untidy data 1

this is untidy if we want to analyze/plot grade as a function of exam type

grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
                midterm = c(3.7, 1.0, 1.3),
                final = c(4.0, 1.3, 1.0))
grades

## # A tibble: 3 x 3
##   name    midterm final
##   <chr>     <dbl> <dbl>
## 1 Michael     3.7   4  
## 2 Noa         1     1.3
## 3 MadEye      1.3   1

to tidy up, we need to gather columns which are not separate variables into a new column

grades %>% gather('midterm', 'final', 
                  key = 'exam', value = 'grade')

## # A tibble: 6 x 3
##   name    exam    grade
##   <chr>   <chr>   <dbl>
## 1 Michael midterm   3.7
## 2 Noa     midterm   1  
## 3 MadEye  midterm   1.3
## 4 Michael final     4  
## 5 Noa     final     1.3
## 6 MadEye  final     1

untidy data 2

this is untidy if we want to analyze grade as a function of participation

results = tibble(name = c('Michael', 'Noa', 'MadEye', 
                          'Michael', 'Noa', 'MadEye'),
                 what = rep(c('grade', 'participation'), 
                            each = 3),
                 howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results

## # A tibble: 6 x 3
##   name    what          howmuch
##   <chr>   <chr>           <dbl>
## 1 Michael grade             3.7
## 2 Noa     grade             1  
## 3 MadEye  grade             1  
## 4 Michael participation    55  
## 5 Noa     participation   100  
## 6 MadEye  participation   100

to tidy up, we need to spread cells from a row out over several columns

results %>% spread(key = 'what', value = 'howmuch')

## # A tibble: 3 x 3
##   name    grade participation
##   <chr>   <dbl>         <dbl>
## 1 MadEye    1             100
## 2 Michael   3.7            55
## 3 Noa       1             100

case study

truth-value judgement task

binary

test your intuitions

“Some of the circles are black.”

0balls 1balls 2balls 3balls

4balls 1balls 2balls 3balls

4balls 4balls 4balls

rating scale task

design

replication/extension of previous work
- van Tiel & Geurts (2014), van Tiel (2014), Degen & Tanenhaus (2015)
4 experimental variants:
- binary truth-value judgements vs. 7-point rating scale
- include filler sentences with $\textit{many}$ and $\textit{most}$ or not
participants recruited via Amazon’s Mechanical Turk
- each subject rated 3 sentences with some
- pseudo-randomized order; fully randomized visual displays

dummy

expTable

data wrangling

read data

d = readr::read_csv('data/00_typicality_some.csv') # from package 'readr'

## Parsed with column specification:
## cols(
##   id = col_integer(),
##   language = col_character(),
##   rt = col_integer(),
##   type = col_character(),
##   response = col_integer(),
##   nr_black = col_integer(),
##   variant = col_character(),
##   comments = col_character()
## )

inspect data

## # A tibble: 5,112 x 8
##       id language    rt type   response nr_black variant comments
##    <int> <chr>    <int> <chr>     <int>    <int> <chr>   <chr>   
##  1     1 English   3930 filler        1        5 C       No      
##  2     1 English   3108 most          0        5 C       No      
##  3     1 English   2599 filler        1        8 C       No      
##  4     1 English   4405 many          1        7 C       No      
##  5     1 English   2574 some          1        6 C       No      
##  6     1 English   1917 filler        1        3 C       No      
##  7     1 English   2471 filler        0        3 C       No      
##  8     1 English   2495 many          0        6 C       No      
##  9     1 English   2093 some          1        9 C       No      
## 10     1 English   1767 filler        0        2 C       No      
## # ... with 5,102 more rows

any comments?

levels(factor(d$comments))[1:20]

##  [1] "bonuses always help with a toddler in the home ;)"                                                                         
##  [2] "Cheers."                                                                                                                   
##  [3] "cool!"                                                                                                                     
##  [4] "Easy HIT thanks!"                                                                                                          
##  [5] "Everything worked fine, thanks"                                                                                            
##  [6] "fun"                                                                                                                       
##  [7] "Fun and interactive. Thank you!"                                                                                           
##  [8] "fun fun fun"                                                                                                               
##  [9] "Fun study"                                                                                                                 
## [10] "Fun study, thanks"                                                                                                         
## [11] "fun survey"                                                                                                                
## [12] "Fun, thanks!"                                                                                                              
## [13] "good hit"                                                                                                                  
## [14] "Good luck with your research!"                                                                                             
## [15] "great hit"                                                                                                                 
## [16] "Great hit, good luck with your research."                                                                                  
## [17] "Great HIT!"                                                                                                                
## [18] "Great survey. Thank you!"                                                                                                  
## [19] "Hi"                                                                                                                        
## [20] "I accidentally clicked \"false\" on one of the \"some are black\" statements.  It was the one where around half were black"

self-reported native languages

table(d$language)

## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish           FRENCH         Japanese 
##               13                8               13                8 
##          Russian          Spanish            Tamil            white 
##                8                8                8               13

filter non-native speakers of Enblush

d = dplyr::filter(d, ! language %in% c("FRENCH", "Japanese", "Russian", "Spanish", "Tamil", "white"))
table(d$language)

## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish 
##               13                8

select relevant columns & rows

d = d %>% dplyr::filter(type == "some") %>% 
          dplyr::select(-language, -comments, -type)
d

## # A tibble: 1,449 x 5
##       id    rt response nr_black variant
##    <int> <int>    <int>    <int> <chr>  
##  1     1  2574        1        6 C      
##  2     1  2093        1        9 C      
##  3     1  2543        1        3 C      
##  4     2  1857        4        5 B      
##  5     2 11454        4       10 B      
##  6     2  2053        4        3 B      
##  7     3  1479        1       10 D      
##  8     3  1640        0        0 D      
##  9     3  1199        1        7 D      
## 10     4  4828        6        6 B      
## # ... with 1,439 more rows

more intelligible column names

d = d %>% dplyr::rename(condition = nr_black)
d

## # A tibble: 1,449 x 5
##       id    rt response condition variant
##    <int> <int>    <int>     <int> <chr>  
##  1     1  2574        1         6 C      
##  2     1  2093        1         9 C      
##  3     1  2543        1         3 C      
##  4     2  1857        4         5 B      
##  5     2 11454        4        10 B      
##  6     2  2053        4         3 B      
##  7     3  1479        1        10 D      
##  8     3  1640        0         0 D      
##  9     3  1199        1         7 D      
## 10     4  4828        6         6 B      
## # ... with 1,439 more rows

adding columns

d = d %>% dplyr::mutate(dependent.measure = ifelse(variant %in% c("A", "B"), "ordinal", "binary"),
                        alternatives = factor(ifelse(variant %in% c("A", "C"), "present", "absent"))) %>% 
          dplyr::select(- variant)
d

## # A tibble: 1,449 x 6
##       id    rt response condition dependent.measure alternatives
##    <int> <int>    <int>     <int> <chr>             <fct>       
##  1     1  2574        1         6 binary            present     
##  2     1  2093        1         9 binary            present     
##  3     1  2543        1         3 binary            present     
##  4     2  1857        4         5 ordinal           absent      
##  5     2 11454        4        10 ordinal           absent      
##  6     2  2053        4         3 ordinal           absent      
##  7     3  1479        1        10 binary            absent      
##  8     3  1640        0         0 binary            absent      
##  9     3  1199        1         7 binary            absent      
## 10     4  4828        6         6 ordinal           absent      
## # ... with 1,439 more rows

rescale responses

d = d %>% mutate(response = purrr::map2_dbl(dependent.measure, response, 
                                            function(x,y) { ifelse(x == "ordinal", (y-1)/6, y) } ))
d

## # A tibble: 1,449 x 6
##       id    rt response condition dependent.measure alternatives
##    <int> <int>    <dbl>     <int> <chr>             <fct>       
##  1     1  2574    1             6 binary            present     
##  2     1  2093    1             9 binary            present     
##  3     1  2543    1             3 binary            present     
##  4     2  1857    0.5           5 ordinal           absent      
##  5     2 11454    0.5          10 ordinal           absent      
##  6     2  2053    0.5           3 ordinal           absent      
##  7     3  1479    1            10 binary            absent      
##  8     3  1640    0             0 binary            absent      
##  9     3  1199    1             7 binary            absent      
## 10     4  4828    0.833         6 ordinal           absent      
## # ... with 1,439 more rows

get mean RTs for dependent measures

d %>% dplyr::group_by(dependent.measure) %>% 
      dplyr::summarize(mean.response = mean(response))

## # A tibble: 2 x 2
##   dependent.measure mean.response
##   <chr>                     <dbl>
## 1 binary                    0.785
## 2 ordinal                   0.600

get mean responses

resp.summary = d %>% dplyr::group_by(dependent.measure, alternatives, condition) %>% 
                     dplyr::summarize(mean.response = mean(response))
resp.summary

## # A tibble: 44 x 4
## # Groups:   dependent.measure, alternatives [?]
##    dependent.measure alternatives condition mean.response
##    <chr>             <fct>            <int>         <dbl>
##  1 binary            absent               0        0.0909
##  2 binary            absent               1        0.478 
##  3 binary            absent               2        0.778 
##  4 binary            absent               3        0.958 
##  5 binary            absent               4        0.964 
##  6 binary            absent               5        1     
##  7 binary            absent               6        0.938 
##  8 binary            absent               7        0.98  
##  9 binary            absent               8        0.929 
## 10 binary            absent               9        0.967 
## # ... with 34 more rows

data visualization

a naked plot

ggplot()

plotting mean responses

ggplot(data = resp.summary, aes(x = condition, y = mean.response)) +
  geom_point()

plotting mean responses per treatment

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point()

plotting mean responses per treatment & depentend measure

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point() + geom_line() + facet_grid( . ~ dependent.measure)

some cosmetics

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point() + geom_line() + facet_grid( . ~ dependent.measure) + 
  xlab("number of black balls") + ylab("mean response") +
  scale_x_continuous(breaks = 0:10) + scale_color_manual(values = c("darkgrey", "firebrick"))

Rmarkdown

why Rmarkdown

prepare, analyze & plot data right inside your document
hand over all of your work in one single, easily executable chunk
- support reproducible and open research
export to a variety of different formats

Rmarkdown formats

flow of information

Rmarkdown info flow

Rmarkdown formats

markdown

headers & sections

# header 1
## header 2
### header 3

emphasis, highlighting etc.

*italics* or _italics_
**bold** or __italics__
~~strikeout~~

links

[link](https://www.google.com)

inline code & code blocks

`function(x) return(x - 1)`

cheat sheet

Rmarkdown

extension of markdown to dynamically integrate R output

multiple output formats:

HTML pages, HTML slides (here), …
PDF, LaTeX, Word, …

cheat sheet and a quick tour

supports LaTeX

inline equations with $\theta$

equation blocks with

$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$

caveat

LaTeX-style formulas will be rendered differently depending on the output method:

PDF-LaTeX gives you genuine LaTeX with (almost) all abilities
HTML output uses MathJax to emulate LaTeX-like behavior
- only LaTeX-packages & functionality emulated in JS will be available

Rmarkdown in your homework

do it all in one file BDACM_HW1-LastnameFirstname.Rmd

use a header that generate HTML files like this:

---
title: "My flawless first homework set"
date: 2018-11-30
output: html_document
---

follow the instructions given in the first homework assignment

send the *.Rmd and the *.HTML as a *.zip

avoid using extra material not included in the *.Rmd

fini

homework

dive into R for Data Science

complete the first homework assignment by Tuesday, November 6, 09:59 CET
- follow the instructions given in the assigment sheet!

Using R

Michael Franke

topics for today

R for data science

R4DS

data science?

what R is (not)

past & present

extensibility & community support

base R & package functions

tidyverse

RStudio

basics of R

overview

general remarks about R

help

numbers, vectors & matrics

numbers, vectors & matrices (2)

character vectors and factors

lists & data frames

tibbles

probability distributions in R

example

maps & pipes (tidyverse)

functions

tidy (rectangular) data

types of data

rectangular data

tidy data

untidy data 1

untidy data 2

case study

truth-value judgement task

test your intuitions

rating scale task

design

data wrangling

read data

inspect data

any comments?

self-reported native languages

filter non-native speakers of Enblush

select relevant columns & rows

more intelligible column names

adding columns

rescale responses

get mean RTs for dependent measures

get mean responses

data visualization

a naked plot

plotting mean responses

plotting mean responses per treatment

plotting mean responses per treatment & depentend measure

some cosmetics

Rmarkdown

why Rmarkdown

flow of information

markdown

Rmarkdown

supports LaTeX

Rmarkdown in your homework

fini

homework