<- seq(from = 0, to = 20, by = 2)
a <- a + 1
a <- a[1:10]
a a
## [1] 1 3 5 7 9 11 13 15 17 19
To learn about a new programming language entails to first learn something about what kinds of objects (elements, first-order citizens) you will have to deal with.
Let’s therefore briefly go through the data types that are most important for our later purposes.
We will see how to deal with numeric information, Booleans, strings and so forth.
In general, we can assess the type of an object stored in variable x
with the function typeof(x)
.
Let’s just try this for a bunch of things, just to give you an overview of some of R’s data types (not all of which are important to know about right from the start):
typeof(3) # returns type "double"
typeof(TRUE) # returns type "logical"
typeof(cars) # returns "list" (includes data.frames, tibbles, objects, ...)
typeof("huhu") # returns "character" (= string)
typeof(mean) # returns "closure" (= function)
typeof(c) # returns "builtin" (= deep system internal stuff)
typeof(round) # returns type "special" (= well, special stuff?)
If you really wonder, you can sometimes learn more about an object, if you just print it out as a string:
# `lm` is actually a function ("linear model")
# the function `str` casts this function into a string
# the result is then printed to screen
str(lm)
## function (formula, data, subset, weights, na.action, method = "qr", model = TRUE,
## x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
## offset, ...)
It is sometimes possible to cast objects of one type into another type XXX
using functions as.XXX
in base R or as_XXX
in the tidyverse.
# casting Boolean value `TRUE` into number format
as.numeric(TRUE) # returns 1
## [1] 1
Casting can also happen explicitly. The expressions TRUE
and FALSE
are built-in variables for the Boolean values “true” and “false”. But when we use them in mathematical expressions, we can do math with them, like so:
TRUE + TRUE + FALSE + TRUE + TRUE
## [1] 4
R is essentially an array-based language. Arrays are arbitrary but finite-dimensional matrices. We will discuss what is usually referred to as vectors (= one-dimensional arrays), matrices (= two-dimensional arrays), and arrays (= more-than-two-dimensional) in this section with a focus on numeric information. But it is important to keep in mind that arrays can contain objects of other types than numeric information (as long as all objects in the array are of the same type).
Standard number format in R is double.
typeof(3)
## [1] "double"
We can also represent numbers as integers and complex.
typeof(as.integer(3)) # returns 'integer'
## [1] "integer"
typeof(as.complex(3)) # returns 'complex'
## [1] "complex"
As a generally useful heuristic, expect every numerical information to be treated as a vector (or higher-order: matrix, array, … ; see below), and to expect any (basic, mathematical) operation in R to (most likely) apply to the whole vector, matrix, array, collection.7 This makes it possible to ask for the length of a variable to which we assign a single number, for instance:
<- 7
x length(x)
## [1] 1
We can even index such a variable:
<- 7
x 1] # what is the entry in position 1 of the vector x? x[
## [1] 7
Or assign a new value to a hitherto unused index:
3] <- 6 # assign the value 6 to the 3rd entry of vector x
x[# notice that the 2nd entry is undefined, or "NA", not available x
## [1] 7 NA 6
Vectors in general can be declared with the built-in function c()
. To memorize this, think of concatenation or combination.
<- c(4, 7, 1, 1) # this is now a 4-place vector
x x
## [1] 4 7 1 1
There are also helpful functions to generate sequences of numbers:
1:10 # returns 1, 2, 3, ..., 10
seq(from = 1, to = 10, by = 1) # returns 1, 2, 3, ..., 10
seq(from = 1, to = 10, by = 0.5) # returns 1, 1.5, 2, ..., 9.5, 10
seq(from = 0, to = 1 , length.out = 11) # returns 0, 0.1, ..., 0.9, 1
Indexing in R starts with 1, not 0!
<- c(4, 7, 1, 1) # this is now a 4-place vector
x 2] x[
## [1] 7
And now we see what is meant above when we said that (almost) every mathematical operation can be expected to apply to a vector:
<- c(4, 7, 1, 1) # 4-placed vector as before
x + 1 x
## [1] 5 8 2 2
Exercise 2.5
Create a vector that contains all even numbers from 0 to 20 and assign it to a variable. Now transform the variable such that it contains only odd numbers up to 20 using mathematical operation. Notice that the numbers above 20 should not be included! [Hint: use indexing.]
<- seq(from = 0, to = 20, by = 2)
a <- a + 1
a <- a[1:10]
a a
## [1] 1 3 5 7 9 11 13 15 17 19
Matrices are declared with the function matrix
. This function takes, for instance, a vector as an argument.
<- c(4, 7, 1, 1) # 4-placed vector as before
x <- matrix(x)) # cast x into matrix format (m
## [,1]
## [1,] 4
## [2,] 7
## [3,] 1
## [4,] 1
Notice that the result is a matrix with a single column. This is important. R uses so-called column-major mode.8 This means that it will fill columns first. For example, a matrix with three columns based on a six-placed vector 1, 2, \(\dots\), 6 will be built by filling the first column from top to bottom, then the second column top to bottom, and so on.9
<- matrix(1:6, ncol = 3)
m m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
In line with a column-major mode, vectors are treated as column vectors in matrix operations:
= c(1, 0, 1) # 3-place vector
x %*% x # dot product with previous matrix 'm' m
## [,1]
## [1,] 6
## [2,] 8
As usual, and independently of a column- or row-major mode, matrix indexing starts with the row index:
1,] # produces first row of matrix 'm' m[
## [1] 1 3 5
Exercise 2.6
Create a sequence of 9 numbers, equally spaced, starting from 0 and ending with 1. Assign this sequence to a vector called x
. Now, create a matrix, stored in variable X
, with three columns and three rows that contain the numbers of this vector in the usual column-major fashion.
<- seq(from = 0, to = 1, length.out = 9)
x <- matrix(x, ncol = 3)
X X
## [,1] [,2] [,3]
## [1,] 0.000 0.375 0.750
## [2,] 0.125 0.500 0.875
## [3,] 0.250 0.625 1.000
We have not yet covered this, but give it a try and guess what might be a convenient and very short statement to compute the sum of all numbers in matrix X
.
sum(X)
## [1] 4.5
Arrays are simply higher-dimensional matrices. We will not make (prominent) use of arrays in this book.
The positions in a vector can be given names. This is extremely useful for good “literate coding” and therefore highly recommended. The names of vector x
’s positions are retrieved and set by the names
function:10
<- c("Jax", "Jamie", "Jason") # names of students
students <- c(1.3, 2.7, 2.0) # a vector of grades
grades names(grades) # retrieve names: with no names so far
## NULL
names(grades) <- students # assign names
names(grades) # retrieve names again: names assigned
## [1] "Jax" "Jamie" "Jason"
# output shows names grades
## Jax Jamie Jason
## 1.3 2.7 2.0
But we can also do this in one swoop, like so:
c(Jax = 1.3, Jamie = 2.7, Jason = 2.0)
## Jax Jamie Jason
## 1.3 2.7 2.0
Names for matrices are retrieved or set with functions rownames
and colnames
.
# declare matrix
<- matrix(1:6, ncol = 3)
m # assign row and column names, using function
# `str_c` which is described below
rownames(m) <- str_c("row", 1:nrow(m), sep = "_")
colnames(m) <- str_c("col", 1:ncol(m), sep = "_")
m
## col_1 col_2 col_3
## row_1 1 3 5
## row_2 2 4 6
There are built-in names for Boolean values “true” and “false”, predictably named TRUE
and FALSE
. Equivalent shortcuts are T
and F
. If we attempt to do math with Boolean vectors, the outcome is what any reasonable logician would expect:
<- c(T, F, T)
x 1 - x
## [1] 0 1 0
+ 3 x
## [1] 4 3 4
Boolean vectors can be used as index sets to extract elements from other vectors.
# vector 1, 2, ..., 5
<- 1:5
number_vector # index of odd numbers set to `TRUE`
<- c(T, F, T, F, T)
boolean_vector # returns the elements from number vector, for which
# the corresponding element in the Boolean vector is true
number_vector[boolean_vector]
## [1] 1 3 5
There are a couple of keywords reserved in R for special kinds of objects:
NA
: “not available”; represent missing values in dataNaN
: “not a number”; e.g., division zero by zeroInf
or -Inf
: infinity and negative infinity; returned when a number is too big or divided by zeroNULL
: the NULL object; often returned when a function is undefined for the provided inputStrings are called characters in R. We will be stubborn and call them strings for most of the time here. We can assign a string value to a variable by putting the string in double-quotes:
<- "huhu"
x typeof(x)
## [1] "character"
We can create vectors of characters in the obvious way:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector chr_vector
## [1] "huhu" "hello" "huhu" "ciao"
The package stringr
from the tidyverse also provides very useful and, in comparison to base R, more uniform functions for string manipulation. The cheat sheet for the stringr
package is highly recommended for a quick overview. Below are some examples.
Function str_c
concatenates strings:
str_c("Hello", "Hi", "Hey", sep = "! ")
## [1] "Hello! Hi! Hey"
We can find the indices of matches in a character vector with str_which
:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector str_which(chr_vector, "hu")
## [1] 1 3
Similarly, str_detect
gives a Boolean vector of matching:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector str_detect(chr_vector, "hu")
## [1] TRUE FALSE TRUE FALSE
If we want to get the strings matching a pattern, we can use str_subset
:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector str_subset(chr_vector, "hu")
## [1] "huhu" "huhu"
Replacing all matches with another string works with str_replace_all
:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector str_replace_all(chr_vector, "h", "B")
## [1] "BuBu" "Bello" "BuBu" "ciao"
For data preparation, we often need to split strings by a particular character. For instance, a set of reaction times could be separated by a character line “|”. We can split this string representation to get individual measurements like so:
# three measures of reaction time in a single string
<- "123|234|345"
reaction_times # notice that we need to doubly (!) escape character |
# notice also that the result is a list (see below)
str_split(reaction_times, "\\|", n = 3)
## [[1]]
## [1] "123" "234" "345"
Factors are special vectors, which treat their elements as instances of a finite set of categories.
To create a factor, we can use the function factor
.
The following code creates a factor from a character vector.
Notice that, when printing, we get information of the kinds of entries (= categories) that occurred in the original character vector:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector factor(chr_vector)
## [1] huhu hello huhu ciao
## Levels: ciao hello huhu
For plotting or other representational purposes, it can help to manually specify an ordering on the levels of a factor using the levels
argument:
# the order of levels is changed manually
factor(chr_vector, levels = c("huhu", "ciao", "hello"))
## [1] huhu hello huhu ciao
## Levels: huhu ciao hello
Even though we specified an ordering among factor levels, the last code chunk nonetheless creates what R treats as an unordered factor.
There are also genuine ordered factors.
An ordered factor is created by setting the argument ordered = T
, and optionally also specifying a specific ordering of factor levels, like so:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector factor(
# the vector to treat as factor
chr_vector, ordered = T, # make sure it's treated as ordered factor
levels = c("huhu", "ciao", "hello") # specify order of levels by hand
)
## [1] huhu hello huhu ciao
## Levels: huhu < ciao < hello
Having both unordered and ordered factors is useful for representing data from experiments, e.g., from categorical or ordinal variables (see Chapter 3). The difference between an unordered factor with explicit ordering information and an ordered factor is subtle and not important in the beginning. (This only matters, for example, in the context of regression modeling.)
Factors are trickier to work with than mere vectors because they are rigid about the represented factor levels. Adding an item that does not belong to any of a factor’s levels, leads to trouble:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector <- factor(
my_factor # the vector to treat as factor
chr_vector, ordered = T, # make sure it's treated as ordered factor
levels = c("huhu", "ciao", "hello") # specify order of levels
)5] <- "huhu" # adding a "known category" is okay
my_factor[6] <- "moin" # adding an "unknown category" does not work
my_factor[ my_factor
## [1] huhu hello huhu ciao huhu <NA>
## Levels: huhu < ciao < hello
The forcats
package from the tidyverse helps in dealing with factors. You should check the Cheat Sheet for more helpful functionality. Here is an example of how to expand the levels of a factor:
<- c("huhu", "hello", "huhu", "ciao")
chr_vector <- factor(
my_factor # the vector to treat as factor
chr_vector, ordered = T, # make sure it's treated as ordered factor
levels = c("huhu", "ciao", "hello") # specify order of levels
)5] <- "huhu" # adding a "known category" is okay
my_factor[<- fct_expand(my_factor, "moin") # add new category
my_factor 6] <- "moin" # adding new item now works
my_factor[ my_factor
## [1] huhu hello huhu ciao huhu moin
## Levels: huhu < ciao < hello < moin
It is sometimes useful (especially for plotting) to flexibly reorder the levels of an ordered factor. Here are some useful functions from the forcats
package:
# original factor my_factor
## [1] huhu hello huhu ciao huhu moin
## Levels: huhu < ciao < hello < moin
fct_rev(my_factor) # reverse level order
## [1] huhu hello huhu ciao huhu moin
## Levels: moin < hello < ciao < huhu
fct_relevel( # manually supply new level order
my_factor,c("hello", "ciao", "huhu")
)
## [1] huhu hello huhu ciao huhu moin
## Levels: hello < ciao < huhu < moin
Lists are key-value pairs. They are created with the built-in function list
. The difference between a list and a named vector is that in the latter, all elements must be of the same type. In a list, the elements can be of arbitrary type. They can also be vectors or even lists themselves. For example:
<- list(
my_list single_number = 42,
chr_vector = c("huhu", "ciao"),
nested_list = list(x = 1, y = 2, z = 3)
) my_list
## $single_number
## [1] 42
##
## $chr_vector
## [1] "huhu" "ciao"
##
## $nested_list
## $nested_list$x
## [1] 1
##
## $nested_list$y
## [1] 2
##
## $nested_list$z
## [1] 3
To access a list element by its name (= key), we can use the $
sign followed by the unquoted name, double square brackets [[ "name" ]]
with the quoted name inside, or indices in double brackets, like so:
# all of these return the same list element
$chr_vector my_list
## [1] "huhu" "ciao"
"chr_vector"]] my_list[[
## [1] "huhu" "ciao"
2]] my_list[[
## [1] "huhu" "ciao"
Lists are very important in R because almost all structured data that belongs together is stored as lists. Objects are special kinds of lists. Data is stored in special kinds of lists, so-called data frames or so-called tibbles.
A data frame is base R’s standard format to store data in. A data frame is a list of vectors of equal length. Data sets are instantiated with the function data.frame
:
# fake experimental data
<- data.frame(
exp_data trial = 1:5,
condition = factor(
c("C1", "C2", "C1", "C3", "C2"),
ordered = T
),response = c(121, 133, 119, 102, 156)
) exp_data
## trial condition response
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
Exercise 2.7
Create a vector a
that contains the names of three of your best (imaginary) friends and a vector b
with their (imaginary) age. Create a data frame that represents this information (one column with names and one with respective age). Notice that column names should represent the information they contain!
<- c("M", "N", "H")
a <- c(23, 41, 13)
b <- data.frame(name = a, age = b)
best_friends best_friends
## name age
## 1 M 23
## 2 N 41
## 3 H 13
We can access columns of a data frame, just like we access elements in a list. Additionally, we can also use index notation, like in a matrix:
# gives the value of the cell in row 2, column 3
2, 3] # returns 133 exp_data[
## [1] 133
Exercise 2.8
Display the column of names of your (imaginary) friends from the best_friends
data frame.
"name"] best_friends[
## name
## 1 M
## 2 N
## 3 H
1] best_friends[
## name
## 1 M
## 2 N
## 3 H
Now show only the names of friends who are younger than 22 (or some other age that makes sense for your friends and their ages). [Hint: you can write x <= 22
to get a Boolean vector of the same length as x
with an entry TRUE
at all indices where x
is no bigger than 22.]
$age <= 22, "name"] best_friends[best_friends
## [1] "H"
In RStudio, you can inspect data in data frames (and tibbles (see below)) with the function View
.
Tibbles are the tidyverse counterpart of data frames. We can cast a data frame into a tibble, using as_tibble
. Notice that the information shown for a tibble is much richer than what is provided when printing the content of a data frame.
as_tibble(exp_data)
## # A tibble: 5 × 3
## trial condition response
## <int> <ord> <dbl>
## 1 1 C1 121
## 2 2 C2 133
## 3 3 C1 119
## 4 4 C3 102
## 5 5 C2 156
We can also create a tibble directly with the keyword tibble
. Indeed, the creation of tibbles is conveniently more flexible than the creation of data frames: the former allows dynamic look-up of previously defined elements.
<- tibble(x = 1:10, y = x^2) # dynamic construction possible
my_tibble <- data.frame(x = 1:10, y = x^2) # ERROR :/ my_dataframe
Another important difference between data frames and tibbles concerns the default treatment of character (= string) vectors. When reading in data from a CSV file as a data frame (using function read.csv
), each character vector is treated as a factor by default. But when using read_csv
to read CSV data into a tibble character vector are not treated as factors.
There is also a very convenient function, called tribble
, which allows you to create a tibble by explicitly writing out the information in the rows.
<- tribble(
hw_points ~hw_nr, ~Jax, ~Jamie, ~Jason,
"HW1", 33, 24, 17,
"HW2", 41, 23, 8
) hw_points
## # A tibble: 2 × 4
## hw_nr Jax Jamie Jason
## <chr> <dbl> <dbl> <dbl>
## 1 HW1 33 24 17
## 2 HW2 41 23 8
Exercise 2.9
Assign to the variable bff
a tibble with the following columns (with reasonable names): at least four names of your (imaginary) best friends, their current country of residence, their age, and a Boolean column storing whether they are not older than 23. Ideally, use dynamic construction and the <=
operator as in previous exercises.
<- tibble(
bff name = c("A", "B", "C", "D"),
residence = c("UK", "JP", "CH", "JA"),
age = c(24, 45, 72, 12),
young = age <= 23
) bff
## # A tibble: 4 × 4
## name residence age young
## <chr> <chr> <dbl> <lgl>
## 1 A UK 24 FALSE
## 2 B JP 45 FALSE
## 3 C CH 72 FALSE
## 4 D JA 12 TRUE
If you are familiar with Python’s scipy and numpy packages, this is R’s default mode of treating numerical information.↩︎
Python, on the other hand, uses the reverse row-major mode.↩︎
It is in this sense that the “first index moves fastest” in column-major mode, which is another frequently given explanation of column-major mode.↩︎
Notice that we can create strings (actually called ‘characters’ in R) with double quotes.↩︎