Using R

Michael Franke

topics for today

 

  • basics of R
  • tidyverse
  • tidy data
  • data wrangling
  • plotting
  • Rmarkdown

R for data science

R4DS

R4DS cover

freely available online: R for Data Science

data science?

data scientist

read more

what R is (not)

  • special purpose programming language for data science statistical computing

    • statistics, data mining, data visualization
  • authority says to tell you to not think of R as a programming language!

  • think of it as a tool optimized for creating scripts to manipulate, plot and analyze data

diagram from 'R for Data Science'

past & present

  • a trusted old friend from 1993
  • still thriving
    • see TIOBE ranking (based on search query results)

TIOBE index

extensibility & community support

a lot of innovation and development takes place in packages

go browse some 12,000 packages on CRAN

 

install packages (only once)

load packages (for every session)

base R & package functions

base R functionality is always available

packages bring extra functions

tidyverse

 

overview of tidyverse

tidyverse website

RStudio

integrated development environment for R

RStudio screenshot

cheat sheet

basics of R

overview

  • basic properties of R
  • data types
    • numbers, vectors & matrices
    • characters & factors
    • lists, data.frames & tibbles
  • probability distributions
  • functional programming elements
  • functions

for all base R stuff, check the R manual

general remarks about R

  • free (GNU General Public License)
  • interpreted language
## [1] 42
  • vector/matrix based
## [1] 2 3 4
  • supports object-oriented, procedural & functional styles

  • convenient interfaces to other languages

  • assignment in both directions possible

## [1] TRUE

help

 

qplot {ggplot2} R Documentation
Quick plot

Description

qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.  
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.

Usage

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
  main = NULL, xlab = deparse(substitute(x)),
  ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)

numbers, vectors & matrics

  • standard number precision is double
## [1] "double"
  • vectors are declared using c()
## [1] 10 20 30
  • everything is a vector (possibly length 1)
## [1] 1 1
  • indexing starts at 1
## [1] 20

numbers, vectors & matrices (2)

  • column-major mode
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## [1] 1 3 5
  • vectors are column vectors
##      [,1]
## [1,]  220
## [2,]  280

character vectors and factors

  • strings are called characters
## [1] "character"
  • vector of characters
## [1] "huhu"  "hello" "huhu"  "ciao"
  • factors track levels
## [1] huhu  hello huhu  ciao 
## Levels: ciao hello huhu
  • ordered factors arrange their levels
## [1] huhu  hello huhu  ciao 
## Levels: huhu < ciao < hello

lists & data frames

  • lists are key-value pairs
  • data frames as lists of same-length vectors
##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156
  • access colums
## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3
  • access rows
##   trial condition response
## 3     3        C1      119

tibbles

  • tibbles are data frames in the tidyverse
## # A tibble: 5 x 3
##   trial condition response
##   <int> <ord>        <dbl>
## 1     1 C1             121
## 2     2 C2             133
## 3     3 C1             119
## 4     4 C3             102
## 5     5 C2             156
  • compare to previous data frame
##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

   

   

  • some differences

probability distributions in R

  • R has many built-in probability distributions
    • normal distribution
    • beta distribution
  • additional distributions supplied by packages
    • multi-variate normal
    • Dirichlet
  • each distribution mydist is associated with four functions:
    1. dmydist(x, ...) gives the probability (mass/density) \(f(x)\) for x
    2. pmydist(x, ...) gives the cumulative distribution function \(F(x)\) for x
    3. qmydist(p, ...) gives the value \(x\) for which p = pmydist(x, ...)
    4. rmydist(n, ...) returns n samples from the distribution

example

maps & pipes (tidyverse)

  • mapping
##     IQ     RT 
## 113.75  75.75
  • piping
##     IQ     RT 
## 113.75  75.75

functions

  • named custom functions
## [1] 5
  • anonymous functions
## IQ RT 
## 25 40

tidy (rectangular) data

types of data

 

data from experimental (psych) studies is usually rectangular data

 

examples of (usually) non-rectangular data:

  • image data
  • sound data
  • video data
  • corpora

 

the tidyverse is particularly efficient for dealing with tidy rectangular data

rectangular data

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

study Chapters 5 and 12 from R for Data Science

tidy data

  1. each variable is a column
  2. each observation is a row
  3. each value is a cell

 

tidy data

untidy data 1

this is untidy if we want to analyze/plot grade as a function of exam type

## # A tibble: 3 x 3
##   name    midterm final
##   <chr>     <dbl> <dbl>
## 1 Michael     3.7   4  
## 2 Noa         1     1.3
## 3 MadEye      1.3   1

to tidy up, we need to gather columns which are not separate variables into a new column

## # A tibble: 6 x 3
##   name    exam    grade
##   <chr>   <chr>   <dbl>
## 1 Michael midterm   3.7
## 2 Noa     midterm   1  
## 3 MadEye  midterm   1.3
## 4 Michael final     4  
## 5 Noa     final     1.3
## 6 MadEye  final     1

untidy data 2

this is untidy if we want to analyze grade as a function of participation

## # A tibble: 6 x 3
##   name    what          howmuch
##   <chr>   <chr>           <dbl>
## 1 Michael grade             3.7
## 2 Noa     grade             1  
## 3 MadEye  grade             1  
## 4 Michael participation    55  
## 5 Noa     participation   100  
## 6 MadEye  participation   100

to tidy up, we need to spread cells from a row out over several columns

## # A tibble: 3 x 3
##   name    grade participation
##   <chr>   <dbl>         <dbl>
## 1 MadEye    1             100
## 2 Michael   3.7            55
## 3 Noa       1             100

case study

truth-value judgement task

binary

test your intuitions

“Some of the circles are black.”

0balls 1balls 2balls 3balls

4balls 1balls 2balls 3balls

4balls 4balls 4balls

rating scale task

ordinal

design

  • replication/extension of previous work
    • van Tiel & Geurts (2014), van Tiel (2014), Degen & Tanenhaus (2015)
  • 4 experimental variants:
    • binary truth-value judgements vs. 7-point rating scale
    • include filler sentences with \(\textit{many}\) and \(\textit{most}\) or not
  • participants recruited via Amazon’s Mechanical Turk
    • each subject rated 3 sentences with some
    • pseudo-randomized order; fully randomized visual displays

dummy

expTable

data wrangling

read data

## Parsed with column specification:
## cols(
##   id = col_integer(),
##   language = col_character(),
##   rt = col_integer(),
##   type = col_character(),
##   response = col_integer(),
##   nr_black = col_integer(),
##   variant = col_character(),
##   comments = col_character()
## )

inspect data

## # A tibble: 5,112 x 8
##       id language    rt type   response nr_black variant comments
##    <int> <chr>    <int> <chr>     <int>    <int> <chr>   <chr>   
##  1     1 English   3930 filler        1        5 C       No      
##  2     1 English   3108 most          0        5 C       No      
##  3     1 English   2599 filler        1        8 C       No      
##  4     1 English   4405 many          1        7 C       No      
##  5     1 English   2574 some          1        6 C       No      
##  6     1 English   1917 filler        1        3 C       No      
##  7     1 English   2471 filler        0        3 C       No      
##  8     1 English   2495 many          0        6 C       No      
##  9     1 English   2093 some          1        9 C       No      
## 10     1 English   1767 filler        0        2 C       No      
## # ... with 5,102 more rows

any comments?

##  [1] "bonuses always help with a toddler in the home ;)"                                                                         
##  [2] "Cheers."                                                                                                                   
##  [3] "cool!"                                                                                                                     
##  [4] "Easy HIT thanks!"                                                                                                          
##  [5] "Everything worked fine, thanks"                                                                                            
##  [6] "fun"                                                                                                                       
##  [7] "Fun and interactive. Thank you!"                                                                                           
##  [8] "fun fun fun"                                                                                                               
##  [9] "Fun study"                                                                                                                 
## [10] "Fun study, thanks"                                                                                                         
## [11] "fun survey"                                                                                                                
## [12] "Fun, thanks!"                                                                                                              
## [13] "good hit"                                                                                                                  
## [14] "Good luck with your research!"                                                                                             
## [15] "great hit"                                                                                                                 
## [16] "Great hit, good luck with your research."                                                                                  
## [17] "Great HIT!"                                                                                                                
## [18] "Great survey. Thank you!"                                                                                                  
## [19] "Hi"                                                                                                                        
## [20] "I accidentally clicked \"false\" on one of the \"some are black\" statements.  It was the one where around half were black"

self-reported native languages

## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish           FRENCH         Japanese 
##               13                8               13                8 
##          Russian          Spanish            Tamil            white 
##                8                8                8               13

filter non-native speakers of Enblush

## 
## American English          Egnlish          Enblish         Englashi 
##                8               13                8                8 
##          english          English          ENGLISH          englsih 
##             2410             2457              103               26 
##          englush         Enlglish 
##               13                8

select relevant columns & rows

## # A tibble: 1,449 x 5
##       id    rt response nr_black variant
##    <int> <int>    <int>    <int> <chr>  
##  1     1  2574        1        6 C      
##  2     1  2093        1        9 C      
##  3     1  2543        1        3 C      
##  4     2  1857        4        5 B      
##  5     2 11454        4       10 B      
##  6     2  2053        4        3 B      
##  7     3  1479        1       10 D      
##  8     3  1640        0        0 D      
##  9     3  1199        1        7 D      
## 10     4  4828        6        6 B      
## # ... with 1,439 more rows

more intelligible column names

## # A tibble: 1,449 x 5
##       id    rt response condition variant
##    <int> <int>    <int>     <int> <chr>  
##  1     1  2574        1         6 C      
##  2     1  2093        1         9 C      
##  3     1  2543        1         3 C      
##  4     2  1857        4         5 B      
##  5     2 11454        4        10 B      
##  6     2  2053        4         3 B      
##  7     3  1479        1        10 D      
##  8     3  1640        0         0 D      
##  9     3  1199        1         7 D      
## 10     4  4828        6         6 B      
## # ... with 1,439 more rows

adding columns

## # A tibble: 1,449 x 6
##       id    rt response condition dependent.measure alternatives
##    <int> <int>    <int>     <int> <chr>             <fct>       
##  1     1  2574        1         6 binary            present     
##  2     1  2093        1         9 binary            present     
##  3     1  2543        1         3 binary            present     
##  4     2  1857        4         5 ordinal           absent      
##  5     2 11454        4        10 ordinal           absent      
##  6     2  2053        4         3 ordinal           absent      
##  7     3  1479        1        10 binary            absent      
##  8     3  1640        0         0 binary            absent      
##  9     3  1199        1         7 binary            absent      
## 10     4  4828        6         6 ordinal           absent      
## # ... with 1,439 more rows

rescale responses

## # A tibble: 1,449 x 6
##       id    rt response condition dependent.measure alternatives
##    <int> <int>    <dbl>     <int> <chr>             <fct>       
##  1     1  2574    1             6 binary            present     
##  2     1  2093    1             9 binary            present     
##  3     1  2543    1             3 binary            present     
##  4     2  1857    0.5           5 ordinal           absent      
##  5     2 11454    0.5          10 ordinal           absent      
##  6     2  2053    0.5           3 ordinal           absent      
##  7     3  1479    1            10 binary            absent      
##  8     3  1640    0             0 binary            absent      
##  9     3  1199    1             7 binary            absent      
## 10     4  4828    0.833         6 ordinal           absent      
## # ... with 1,439 more rows

get mean RTs for dependent measures

## # A tibble: 2 x 2
##   dependent.measure mean.response
##   <chr>                     <dbl>
## 1 binary                    0.785
## 2 ordinal                   0.600

get mean responses

## # A tibble: 44 x 4
## # Groups:   dependent.measure, alternatives [?]
##    dependent.measure alternatives condition mean.response
##    <chr>             <fct>            <int>         <dbl>
##  1 binary            absent               0        0.0909
##  2 binary            absent               1        0.478 
##  3 binary            absent               2        0.778 
##  4 binary            absent               3        0.958 
##  5 binary            absent               4        0.964 
##  6 binary            absent               5        1     
##  7 binary            absent               6        0.938 
##  8 binary            absent               7        0.98  
##  9 binary            absent               8        0.929 
## 10 binary            absent               9        0.967 
## # ... with 34 more rows

data visualization

a naked plot

plotting mean responses

plotting mean responses per treatment

plotting mean responses per treatment & depentend measure

some cosmetics

Rmarkdown

why Rmarkdown

 

  • prepare, analyze & plot data right inside your document

  • hand over all of your work in one single, easily executable chunk
    • support reproducible and open research
  • export to a variety of different formats

Rmarkdown formats

flow of information

 

Rmarkdown info flow

 

Rmarkdown formats

markdown

headers & sections

emphasis, highlighting etc.

links

inline code & code blocks

cheat sheet

Rmarkdown

extension of markdown to dynamically integrate R output

multiple output formats:

  • HTML pages, HTML slides (here), …
  • PDF, LaTeX, Word, …

cheat sheet and a quick tour

supports LaTeX

inline equations with $\theta$

equation blocks with

$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$

 

caveat

LaTeX-style formulas will be rendered differently depending on the output method:

  • PDF-LaTeX gives you genuine LaTeX with (almost) all abilities
  • HTML output uses MathJax to emulate LaTeX-like behavior
    • only LaTeX-packages & functionality emulated in JS will be available

Rmarkdown in your homework

 

do it all in one file BDACM_HW1-LastnameFirstname.Rmd

use a header that generate HTML files like this:

---
title: "My flawless first homework set"
date: 2018-11-30
output: html_document
---

follow the instructions given in the first homework assignment

send the *.Rmd and the *.HTML as a *.zip

avoid using extra material not included in the *.Rmd

fini

homework

 

 

  • complete the first homework assignment by Tuesday, November 6, 09:59 CET
    • follow the instructions given in the assigment sheet!