Using R

Michael Franke

topics for today

basics of R
tidyverse
tidy data
data wrangling
plotting
Rmarkdown

R for data science

R4DS

R4DS cover

freely available online: R for Data Science

data science?

data scientist

special purpose programming language for ~~data science~~ statistical computing
- statistics, data mining, data visualization
authority says to tell you to not think of R as a programming language!
think of it as a tool optimized for creating scripts to manipulate, plot and analyze data

diagram from 'R for Data Science'

past & present

a trusted old friend from 1993
still thriving
- see TIOBE ranking (based on search query results)

TIOBE index

extensibility & community support

a lot of innovation and development takes place in packages

go browse some 12,000 packages on CRAN

install packages (only once)

install.packages('tidyverse')

load packages (for every session)

library(tidyverse)

base R & package functions

base R functionality is always available

x = seq(from = 1, to = 10, length.out = 1000)
plot(x,x^2)

packages bring extra functions

library(ggplot2)
ggplot2::qplot(x,x^2)

tidyverse

overview of tidyverse

tidyverse website

RStudio

integrated development environment for R

RStudio screenshot

cheat sheet

basics of R

overview

basic properties of R
data types
- numbers, vectors & matrices
- characters & factors
- lists, data.frames & tibbles
probability distributions
functional programming elements
functions

for all base R stuff, check the R manual

general remarks about R

free (GNU General Public License)
interpreted language

6 * 7

## [1] 42

vector/matrix based

x = c(1,2,3)
x + 1

## [1] 2 3 4

supports object-oriented, procedural & functional styles
convenient interfaces to other languages
assignment in both directions possible

x <- 3
3 -> y
x == y

## [1] TRUE

help

help('qplot')

qplot {ggplot2} R Documentation
Quick plot

Description

qplot is a shortcut designed to be familiar if you're used to base plot(). It's a convenient
wrapper for creating a number of different types of plots using a consistent calling scheme.  
It's great for allowing you to produce plots quickly, but I highly recommend learning ggplot()
as it makes it easier to create complex graphics.

Usage

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "",
  main = NULL, xlab = deparse(substitute(x)),
  ylab = deparse(substitute(y)), asp = NA, stat = NULL, position = NULL)

numbers, vectors & matrics

standard number precision is double

typeof(2)

## [1] "double"

vectors are declared using c()

x = c(10,20,30)
x

## [1] 10 20 30

everything is a vector (possibly length 1)

c(length(200), length("huhu"))

## [1] 1 1

indexing starts at 1

x[2]

## [1] 20

numbers, vectors & matrices (2)

column-major mode

m = matrix(c(1,2,3,4,5,6), nrow = 2)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

m[1,]

## [1] 1 3 5

vectors are column vectors

m %*% x ## dot product

##      [,1]
## [1,]  220
## [2,]  280

character vectors and factors

strings are called characters

typeof("huhu")

## [1] "character"

vector of characters

chr.vector = c("huhu", "hello", "huhu", "ciao")
chr.vector

## [1] "huhu"  "hello" "huhu"  "ciao"

factors track levels

factor(chr.vector)

## [1] huhu  hello huhu  ciao 
## Levels: ciao hello huhu

ordered factors arrange their levels

factor(chr.vector, ordered = T, 
       levels = c("huhu", "ciao", "hello"))

## [1] huhu  hello huhu  ciao 
## Levels: huhu < ciao < hello

lists & data frames

lists are key-value pairs

my.list = list(dudu = 1,
               chacha = c("huhu", "ciao"))

data frames as lists of same-length vectors

exp.data = data.frame(trial = 1:5,
              condition = factor(c("C1", "C2", "C1", 
                                   "C3", "C2"),
                                 ordered = T),
              response = c(121, 133, 119, 102, 156))
exp.data

##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

access colums

exp.data$condition

## [1] C1 C2 C1 C3 C2
## Levels: C1 < C2 < C3

access rows

exp.data[3,]

##   trial condition response
## 3     3        C1      119

tibbles

tibbles are data frames in the tidyverse

as.tibble(exp.data)

## # A tibble: 5 x 3
##   trial condition response
##   <int> <ord>        <dbl>
## 1     1 C1             121
## 2     2 C2             133
## 3     3 C1             119
## 4     4 C3             102
## 5     5 C2             156

compare to previous data frame

exp.data

##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

some differences

my.tibble    = tibble(x = 1:10, y = x^2)      ## dynamic construction possible
my.dataframe = data.frame(x = 1:10, y = x^2)  ## ERROR :/

probability distributions in R

R has many built-in probability distributions
- normal distribution
- beta distribution
- …
additional distributions supplied by packages
- multi-variate normal
- Dirichlet
- …
each distribution mydist is associated with four functions:
1. dmydist(x, ...) gives the probability (mass/density) $f(x)$ for x
2. pmydist(x, ...) gives the cumulative distribution function $F(x)$ for x
3. qmydist(p, ...) gives the value $x$ for which p = pmydist(x, ...)
4. rmydist(n, ...) returns n samples from the distribution

example

x = seq(-5, 5, length.out = 1000)
y = dnorm(x, mean = 1, sd = 0.5)
plot(x,y)

maps & pipes (tidyverse)

mapping

data = tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) )
map_dbl(data, mean)

##     IQ     RT 
## 113.75  75.75

piping

tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(mean)

##     IQ     RT 
## 113.75  75.75

functions

named custom functions

crazy.operation = function(x,y) {
  x+y
}
crazy.operation(2,3)

## [1] 5

anonymous functions

tibble(IQ = c(100,110,120,125), 
              RT = c(67,58,98,80) ) %>% 
  map_dbl(function(i) {max(i)-min(i)})

## IQ RT 
## 25 40

tidy (rectangular) data

types of data

data from experimental (psych) studies is usually rectangular data

examples of (usually) non-rectangular data:

image data
sound data
video data
corpora
…

the tidyverse is particularly efficient for dealing with tidy rectangular data

rectangular data

library(nycflights13)
nycflights13::flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

study Chapters 5 and 12 from R for Data Science

tidy data

each variable is a column
each observation is a row
each value is a cell

tidy data

untidy data 1

this is untidy if we want to analyze/plot grade as a function of exam type

grades = tibble(name = c('Michael', 'Noa', 'MadEye'),
                midterm = c(3.7, 1.0, 1.3),
                final = c(4.0, 1.3, 1.0))
grades

## # A tibble: 3 x 3
##   name    midterm final
##   <chr>     <dbl> <dbl>
## 1 Michael     3.7   4  
## 2 Noa         1     1.3
## 3 MadEye      1.3   1

to tidy up, we need to gather columns which are not separate variables into a new column

grades %>% gather('midterm', 'final', 
                  key = 'exam', value = 'grade')

## # A tibble: 6 x 3
##   name    exam    grade
##   <chr>   <chr>   <dbl>
## 1 Michael midterm   3.7
## 2 Noa     midterm   1  
## 3 MadEye  midterm   1.3
## 4 Michael final     4  
## 5 Noa     final     1.3
## 6 MadEye  final     1

untidy data 2

this is untidy if we want to analyze grade as a function of participation

results = tibble(name = c('Michael', 'Noa', 'MadEye', 
                          'Michael', 'Noa', 'MadEye'),
                 what = rep(c('grade', 'participation'), 
                            each = 3),
                 howmuch = c(3.7, 1.0, 1.0, 55, 100, 100))
results

## # A tibble: 6 x 3
##   name    what          howmuch
##   <chr>   <chr>           <dbl>
## 1 Michael grade             3.7
## 2 Noa     grade             1  
## 3 MadEye  grade             1  
## 4 Michael participation    55  
## 5 Noa     participation   100  
## 6 MadEye  participation   100

to tidy up, we need to spread cells from a row out over several columns

results %>% spread(key = 'what', value = 'howmuch')

## # A tibble: 3 x 3
##   name    grade participation
##   <chr>   <dbl>         <dbl>
## 1 MadEye    1             100
## 2 Michael   3.7            55
## 3 Noa       1             100

ggplot

Layered grammar of graphics

structured description language for plots (relevant for data science)
smart system of defaults
multiple layers:
- data + transformation + geom. object + aesthetics

basic components:
- data
- coordinate system
- statistical transformation
  - means, standard errors, bins, …
- scales
  - continuous, discrete, …
- geometric object
  - how to visualize the data (points, bars, lines, …)
- aesthetic mapping
  - point shape, size, color, …
- facets

for background see Wickham (2010)

example

fully explicit

ggplot() +
  layer(
    data = diamonds,
    mapping = aes(x = carat, y = price),
    geom = "point",
    stat = "identity",
    position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()

with syntactic sugar and defaults

diamonds %>% ggplot(aes(carat, price)) + geom_point()

general structure of a `ggplot` call

screenshot_cheat_sheet

from the cheat sheet

Rmarkdown

why Rmarkdown

prepare, analyze & plot data right inside your document
hand over all of your work in one single, easily executable chunk
- support reproducible and open research
export to a variety of different formats

Rmarkdown formats

flow of information

Rmarkdown info flow

Rmarkdown formats

markdown

headers & sections

# header 1
## header 2
### header 3

emphasis, highlighting etc.

*italics* or _italics_
**bold** or __italics__
~~strikeout~~

links

[link](https://www.google.com)

inline code & code blocks

`function(x) return(x - 1)`

cheat sheet

Rmarkdown

extension of markdown to dynamically integrate R output

multiple output formats:

HTML pages, HTML slides (here), …
PDF, LaTeX, Word, …

cheat sheet and a quick tour

supports LaTeX

inline equations with $\theta$

equation blocks with

$$ \begin{align*} E &= mc^2 \\
& = \text{a really smart forumla}
\end{align*} $$

caveat

LaTeX-style formulas will be rendered differently depending on the output method:

PDF-LaTeX gives you genuine LaTeX with (almost) all abilities
HTML output uses MathJax to emulate LaTeX-like behavior
- only LaTeX-packages & functionality emulated in JS will be available

Using R

Michael Franke

topics for today

R for data science

R4DS

data science?

what R is (not)

past & present

extensibility & community support

base R & package functions

tidyverse

RStudio

basics of R

overview

general remarks about R

help

numbers, vectors & matrics

numbers, vectors & matrices (2)

character vectors and factors

lists & data frames

tibbles

probability distributions in R

example

maps & pipes (tidyverse)

functions

tidy (rectangular) data

types of data

rectangular data

tidy data

untidy data 1

untidy data 2

ggplot

Layered grammar of graphics

example

general structure of a ggplot call

Rmarkdown

why Rmarkdown

flow of information

markdown

Rmarkdown

supports LaTeX

general structure of a `ggplot` call