road map for today

 

  • get in touch with R for data wrangling & plotting
    • think about data & its format
    • manipulate data into appropriate format: data wrangling
    • extract data summaries
    • choose aspects to visualize data

 

  • look at data set from case study on quantifier some

R for data science

data science?

data science?

R

R

 

special purpose programming language data science statistical computing

  • statistics
  • data mining
  • data visualization

 

authority says to tell you: do not think of R as a programming language!

R

 

is a trusted old friend

 

a lot of innovation and development takes place in packages

 

  • install packages (only once)
install.packages('tidyverse')

 

  • load packages (for every session)
library(tidyverse)

base R & package functions

 

base R functionality is always available

rnorm(n = 5, mean = 10) # 5 samples from a normal with mean 10 & std. dev. 1 (default)
## [1] 10.611982 10.138658 10.025993  9.268167  9.181465

 

packages bring extra functions

library(mvtnorm)
mvtnorm::rmvnorm(n = 5, mean = rep(10,5)) # 5 samples from a multivariate normal
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  8.848755  9.459086  9.212209 11.025298 10.782896
## [2,] 11.004074  8.563577  7.770695 11.474569  9.209217
## [3,] 11.629314 10.167166 10.143578  9.107262  9.785214
## [4,] 11.871819 10.113441 10.280624  8.931763 12.570017
## [5,]  9.172942  8.995465  8.879740 10.580737  9.776681

help

 

help('rmvnorm')
Mvnorm {mvtnorm}    R Documentation
Multivariate Normal Density and Random Deviates

Description

These functions provide the density function and a random number generator
for the multivariate normal distribution with mean equal to mean and 
covariance matrix sigma.

Usage

dmvnorm(x, mean = rep(0, p), sigma = diag(p), log = FALSE)
rmvnorm(n, mean = rep(0, nrow(sigma)), sigma = diag(length(mean)),
    method=c("eigen", "svd", "chol"), pre0.9_9994 = FALSE)

RStudio

integrated develop environment for R

 

RStudio screenshot

data

types of data

 

this course will focus (entirely?) on rectangular data

 

not covered:

  • image data
  • sound data
  • video data
  • corpora

rectangular data

library(nycflights13)
nycflights13::flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

study Chapters 5 and 12 from R for Data Science

case study

truth-value judgement task

binary

test your intuitions

"Some of the circles are black."

0balls 1balls 2balls 3balls

4balls 1balls 2balls 3balls

4balls 4balls 4balls

rating scale task

ordinal

design

  • replication/extension of previous work
    • van Tiel & Geurts (2014), van Tiel (2014), Degen & Tanenhaus (2015)
  • 4 experimental variants:
    • binary truth-value judgements vs. 7-point rating scale
    • include filler sentences with \(\textit{many}\) and \(\textit{most}\) or not
  • participants recruited via Amazon's Mechanical Turk
    • each subject rated 3 sentences with some
    • pseudo-randomized order; fully randomized visual displays

dummy

expTable

data wrangling

read data

d = readr::read_csv('../data/00_typicality_some.csv') # from package 'readr'
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   language = col_character(),
##   rt = col_integer(),
##   type = col_character(),
##   response = col_integer(),
##   nr_black = col_integer(),
##   variant = col_character(),
##   comments = col_character()
## )

inspect data

d
## # A tibble: 5,112 × 8
##       id language    rt   type response nr_black variant comments
##    <int>    <chr> <int>  <chr>    <int>    <int>   <chr>    <chr>
## 1      1  English  3930 filler        1        5       C       No
## 2      1  English  3108   most        0        5       C       No
## 3      1  English  2599 filler        1        8       C       No
## 4      1  English  4405   many        1        7       C       No
## 5      1  English  2574   some        1        6       C       No
## 6      1  English  1917 filler        1        3       C       No
## 7      1  English  2471 filler        0        3       C       No
## 8      1  English  2495   many        0        6       C       No
## 9      1  English  2093   some        1        9       C       No
## 10     1  English  1767 filler        0        2       C       No
## # ... with 5,102 more rows

any comments?

levels(factor(d$comments))[1:20]
##  [1] "bonuses always help with a toddler in the home ;)"                                                                         
##  [2] "Cheers."                                                                                                                   
##  [3] "cool!"                                                                                                                     
##  [4] "Easy HIT thanks!"                                                                                                          
##  [5] "Everything worked fine, thanks"                                                                                            
##  [6] "fun"                                                                                                                       
##  [7] "Fun and interactive. Thank you!"                                                                                           
##  [8] "fun fun fun"                                                                                                               
##  [9] "Fun study"                                                                                                                 
## [10] "Fun study, thanks"                                                                                                         
## [11] "fun survey"                                                                                                                
## [12] "Fun, thanks!"                                                                                                              
## [13] "good hit"                                                                                                                  
## [14] "Good luck with your research!"                                                                                             
## [15] "great hit"                                                                                                                 
## [16] "Great hit, good luck with your research."                                                                                  
## [17] "Great HIT!"                                                                                                                
## [18] "Great survey. Thank you!"                                                                                                  
## [19] "Hi"                                                                                                                        
## [20] "I accidentally clicked \"false\" on one of the \"some are black\" statements.  It was the one where around half were black"

self-reported native languages

table(d$language)
## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish           FRENCH         Japanese 
##               13                8               13                8 
##          Russian          Spanish            Tamil            white 
##                8                8                8               13

filter non-native speakers of Enblush

d = dplyr::filter(d, ! language %in% c("FRENCH", "Japanese", "Russian", "Spanish", "Tamil", "white"))
table(d$language)
## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish 
##               13                8

select relevant columns & rows

d = d %>% dplyr::filter(type == "some") %>% 
          dplyr::select(-language, -comments, -type)
d
## # A tibble: 1,449 × 5
##       id    rt response nr_black variant
##    <int> <int>    <int>    <int>   <chr>
## 1      1  2574        1        6       C
## 2      1  2093        1        9       C
## 3      1  2543        1        3       C
## 4      2  1857        4        5       B
## 5      2 11454        4       10       B
## 6      2  2053        4        3       B
## 7      3  1479        1       10       D
## 8      3  1640        0        0       D
## 9      3  1199        1        7       D
## 10     4  4828        6        6       B
## # ... with 1,439 more rows

more intelligible column names

d = d %>% dplyr::rename(condition = nr_black)
d
## # A tibble: 1,449 × 5
##       id    rt response condition variant
##    <int> <int>    <int>     <int>   <chr>
## 1      1  2574        1         6       C
## 2      1  2093        1         9       C
## 3      1  2543        1         3       C
## 4      2  1857        4         5       B
## 5      2 11454        4        10       B
## 6      2  2053        4         3       B
## 7      3  1479        1        10       D
## 8      3  1640        0         0       D
## 9      3  1199        1         7       D
## 10     4  4828        6         6       B
## # ... with 1,439 more rows

adding columns

d = d %>% dplyr::mutate(dependent.measure = ifelse(variant %in% c("A", "B"), "ordinal", "binary"),
                        alternatives = factor(ifelse(variant %in% c("A", "C"), "present", "absent"))) %>% 
          dplyr::select(- variant)
d
## # A tibble: 1,449 × 6
##       id    rt response condition dependent.measure alternatives
##    <int> <int>    <int>     <int>             <chr>       <fctr>
## 1      1  2574        1         6            binary      present
## 2      1  2093        1         9            binary      present
## 3      1  2543        1         3            binary      present
## 4      2  1857        4         5           ordinal       absent
## 5      2 11454        4        10           ordinal       absent
## 6      2  2053        4         3           ordinal       absent
## 7      3  1479        1        10            binary       absent
## 8      3  1640        0         0            binary       absent
## 9      3  1199        1         7            binary       absent
## 10     4  4828        6         6           ordinal       absent
## # ... with 1,439 more rows

rescale responses

d = d %>% mutate(response = purrr::map2_dbl(dependent.measure, response, 
                                            function(x,y) { ifelse(x == "ordinal", (y-1)/6, y) } ))
d
## # A tibble: 1,449 × 6
##       id    rt  response condition dependent.measure alternatives
##    <int> <int>     <dbl>     <int>             <chr>       <fctr>
## 1      1  2574 1.0000000         6            binary      present
## 2      1  2093 1.0000000         9            binary      present
## 3      1  2543 1.0000000         3            binary      present
## 4      2  1857 0.5000000         5           ordinal       absent
## 5      2 11454 0.5000000        10           ordinal       absent
## 6      2  2053 0.5000000         3           ordinal       absent
## 7      3  1479 1.0000000        10            binary       absent
## 8      3  1640 0.0000000         0            binary       absent
## 9      3  1199 1.0000000         7            binary       absent
## 10     4  4828 0.8333333         6           ordinal       absent
## # ... with 1,439 more rows

get mean RTs for dependent measures

d %>% dplyr::group_by(dependent.measure) %>% 
      dplyr::summarize(mean.response = mean(response))
## # A tibble: 2 × 2
##   dependent.measure mean.response
##               <chr>         <dbl>
## 1            binary     0.7854077
## 2           ordinal     0.6002222

get mean responses

resp.summary = d %>% dplyr::group_by(dependent.measure, alternatives, condition) %>% 
                     dplyr::summarize(mean.response = mean(response))
resp.summary
## Source: local data frame [44 x 4]
## Groups: dependent.measure, alternatives [?]
## 
##    dependent.measure alternatives condition mean.response
##                <chr>       <fctr>     <int>         <dbl>
## 1             binary       absent         0    0.09090909
## 2             binary       absent         1    0.47826087
## 3             binary       absent         2    0.77777778
## 4             binary       absent         3    0.95833333
## 5             binary       absent         4    0.96428571
## 6             binary       absent         5    1.00000000
## 7             binary       absent         6    0.93750000
## 8             binary       absent         7    0.98000000
## 9             binary       absent         8    0.92857143
## 10            binary       absent         9    0.96666667
## # ... with 34 more rows

data visualization

a naked plot

ggplot()

plotting mean responses

ggplot(data = resp.summary, aes(x = condition, y = mean.response)) +
  geom_point()

plotting mean responses per treatment

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point()

plotting mean responses per treatment & depentend measure

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point() + geom_line() + facet_grid( . ~ dependent.measure)

some cosmetics

ggplot(data = resp.summary, aes(x = condition, y = mean.response, color = alternatives)) +
  geom_point() + geom_line() + facet_grid( . ~ dependent.measure) + 
  xlab("number of black balls") + ylab("mean response") +
  scale_x_continuous(breaks = 0:10) + scale_color_manual(values = c("darkgrey", "firebrick"))

towards Bayesian modeling

methodological puzzles

  • do binary and ordinal tasks measure the same thing?
    • one is about truth, the other about "goodness"
    • what does it even mean to measure something with a task?
  • is what either task measures influenced by presence/absence of alternatives?
    • what is the effect of additional fillers on judgements?
  • how would we answer these questions with standard statistical techniques?
    • is there a place for pragmatic theory in a statistical model?

custom-made data-generating model

modelGraph

results: PPCs for trained model

PPCs

fini

to prevent boredom

obligatory

 

optional

  • read more on R for Data Science
    • Chapters 5 and 12 on data representation and wrangling
  • explore reading times in today's data set