2.2 Data types

To learn about a new programming language entails to first learn something about what kinds of objects (elements, first-order citizens) you will have to deal with. Let’s therefore briefly go through the data types that are most important for our later purposes. We will see how to deal with numeric information, Booleans, strings and so forth. In general, we can assess the type of an object stored in variable x with the function typeof(x). Let’s just try this for a bunch of things, just to give you an overview of some of R’s data types (not all of which are important to know about right from the start):

typeof(3)        # returns type "double"
typeof(TRUE)     # returns type "logical"
typeof(cars)     # returns "list" (includes data.frames, tibbles, objects, ...)
typeof("huhu")   # returns "character" (= string) 
typeof(mean)     # returns "closure" (= function)
typeof(c)        # returns "builtin" (= deep system internal stuff)
typeof(round)    # returns type "special" (= well, special stuff?)

If you really wonder, you can sometimes learn more about an object, if you just print it out as a string:

# `lm` is actually a function ("linear model")
# the function `str` casts this function into a string
# the result is then printed to screen
str(lm)
## function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, 
##     x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, 
##     offset, ...)

It is sometimes possible to cast objects of one type into another type XXX using functions as.XXX in base R or as_XXX in the tidyverse.

# casting Boolean value `TRUE` into number format 
as.numeric(TRUE)  # returns 1
## [1] 1

Casting can also happen explicitly. The expressions TRUE and FALSE are built-in variables for the Boolean values “true” and “false”. But when we use them in mathematical expressions, we can do math with them, like so:

TRUE + TRUE + FALSE + TRUE + TRUE
## [1] 4

2.2.1 Numeric vectors & matrices

R is essentially an array-based language. Arrays are arbitrary but finite-dimensional matrices. We will discuss what is usually referred to as vectors (= one-dimensional arrays), matrices (= two-dimensional arrays), and arrays (= more-than-two-dimensional) in this section with a focus on numeric information. But it is important to keep in mind that arrays can contain objects of other types than numeric information (as long as all objects in the array are of the same type).

2.2.1.1 Numeric information

Standard number format in R is double.

typeof(3)
## [1] "double"

We can also represent numbers as integers and complex.

typeof(as.integer(3))    # returns 'integer'
## [1] "integer"
typeof(as.complex(3))    # returns 'complex'
## [1] "complex"

2.2.1.2 Numeric vectors

As a generally useful heuristic, expect every numerical information to be treated as a vector (or higher-order: matrix, array, … ; see below), and to expect any (basic, mathematical) operation in R to (most likely) apply to the whole vector, matrix, array, collection.7 This makes it possible to ask for the length of a variable to which we assign a single number, for instance:

x <- 7
length(x)
## [1] 1

We can even index such a variable:

x <- 7
x[1]     # what is the entry in position 1 of the vector x?
## [1] 7

Or assign a new value to a hitherto unused index:

x[3] <- 6   # assign the value 6 to the 3rd entry of vector x
x           # notice that the 2nd entry is undefined, or "NA", not available
## [1]  7 NA  6

Vectors in general can be declared with the built-in function c(). To memorize this, think of concatenation or combination.

x <- c(4, 7, 1, 1)   # this is now a 4-place vector
x
## [1] 4 7 1 1

There are also helpful functions to generate sequences of numbers:

1:10                                     # returns 1, 2, 3, ..., 10
seq(from = 1, to = 10, by = 1)           # returns 1, 2, 3, ..., 10
seq(from = 1, to = 10, by = 0.5)         # returns 1, 1.5, 2, ..., 9.5, 10
seq(from = 0, to = 1 , length.out = 11)  # returns 0, 0.1, ..., 0.9, 1

Indexing in R starts with 1, not 0!

x <- c(4, 7, 1, 1)   # this is now a 4-place vector
x[2]
## [1] 7

And now we see what is meant above when we said that (almost) every mathematical operation can be expected to apply to a vector:

x <- c(4, 7, 1, 1)   # 4-placed vector as before
x + 1
## [1] 5 8 2 2

Exercise 2.5

Create a vector that contains all even numbers from 0 to 20 and assign it to a variable. Now transform the variable such that it contains only odd numbers up to 20 using mathematical operation. Notice that the numbers above 20 should not be included! [Hint: use indexing.]

a <- seq(from = 0, to = 20, by = 2) 
a <- a + 1
a <- a[1:10]
a
##  [1]  1  3  5  7  9 11 13 15 17 19

2.2.1.3 Numeric matrices

Matrices are declared with the function matrix. This function takes, for instance, a vector as an argument.

x <- c(4, 7, 1, 1)     # 4-placed vector as before
(m <- matrix(x))       # cast x into matrix format
##      [,1]
## [1,]    4
## [2,]    7
## [3,]    1
## [4,]    1

Notice that the result is a matrix with a single column. This is important. R uses so-called column-major mode.8 This means that it will fill columns first. For example, a matrix with three columns based on a six-placed vector 1, 2, \(\dots\), 6 will be built by filling the first column from top to bottom, then the second column top to bottom, and so on.9

m <- matrix(1:6, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

In line with a column-major mode, vectors are treated as column vectors in matrix operations:

x = c(1, 0, 1)   # 3-place vector
m %*% x          # dot product with previous matrix 'm'
##      [,1]
## [1,]    6
## [2,]    8

As usual, and independently of a column- or row-major mode, matrix indexing starts with the row index:

m[1,]   # produces first row of matrix 'm'
## [1] 1 3 5

Exercise 2.6

Create a sequence of 9 numbers, equally spaced, starting from 0 and ending with 1. Assign this sequence to a vector called x. Now, create a matrix, stored in variable X, with three columns and three rows that contain the numbers of this vector in the usual column-major fashion.

x <- seq(from = 0, to = 1, length.out = 9)
X <- matrix(x, ncol = 3)
X
##       [,1]  [,2]  [,3]
## [1,] 0.000 0.375 0.750
## [2,] 0.125 0.500 0.875
## [3,] 0.250 0.625 1.000

We have not yet covered this, but give it a try and guess what might be a convenient and very short statement to compute the sum of all numbers in matrix X.

sum(X)
## [1] 4.5

2.2.1.4 Arrays

Arrays are simply higher-dimensional matrices. We will not make (prominent) use of arrays in this book.

2.2.1.5 Names for vectors, matrices and arrays

The positions in a vector can be given names. This is extremely useful for good “literate coding” and therefore highly recommended. The names of vector x’s positions are retrieved and set by the names function:10

students <- c("Jax", "Jamie", "Jason")  # names of students
grades <- c(1.3, 2.7, 2.0)              # a vector of grades
names(grades)                           # retrieve names: with no names so far
## NULL
names(grades) <- students               # assign names
names(grades)                           # retrieve names again: names assigned
## [1] "Jax"   "Jamie" "Jason"
grades                                  # output shows names
##   Jax Jamie Jason 
##   1.3   2.7   2.0

But we can also do this in one swoop, like so:

c(Jax = 1.3, Jamie = 2.7, Jason = 2.0)
##   Jax Jamie Jason 
##   1.3   2.7   2.0

Names for matrices are retrieved or set with functions rownames and colnames.

# declare matrix
m <- matrix(1:6, ncol = 3)  
# assign row and column names, using function
# `str_c` which is described below
rownames(m) <- str_c("row", 1:nrow(m), sep = "_")
colnames(m) <- str_c("col", 1:ncol(m), sep = "_")
m
##       col_1 col_2 col_3
## row_1     1     3     5
## row_2     2     4     6

2.2.2 Booleans

There are built-in names for Boolean values “true” and “false”, predictably named TRUE and FALSE. Equivalent shortcuts are T and F. If we attempt to do math with Boolean vectors, the outcome is what any reasonable logician would expect:

x <- c(T, F, T)
1 - x
## [1] 0 1 0
x + 3
## [1] 4 3 4

Boolean vectors can be used as index sets to extract elements from other vectors.

# vector 1, 2, ..., 5
number_vector  <- 1:5           
# index of odd numbers set to `TRUE`
boolean_vector <- c(T, F, T, F, T)  
# returns the elements from number vector, for which
# the corresponding element in the Boolean vector is true
number_vector[boolean_vector] 
## [1] 1 3 5

2.2.3 Special values

There are a couple of keywords reserved in R for special kinds of objects:

  • NA: “not available”; represent missing values in data
  • NaN: “not a number”; e.g., division zero by zero
  • Inf or -Inf: infinity and negative infinity; returned when a number is too big or divided by zero
  • NULL: the NULL object; often returned when a function is undefined for the provided input

2.2.4 Characters (= strings)

Strings are called characters in R. We will be stubborn and call them strings for most of the time here. We can assign a string value to a variable by putting the string in double-quotes:

x <- "huhu"
typeof(x)
## [1] "character"

We can create vectors of characters in the obvious way:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
chr_vector
## [1] "huhu"  "hello" "huhu"  "ciao"

The package stringr from the tidyverse also provides very useful and, in comparison to base R, more uniform functions for string manipulation. The cheat sheet for the stringr package is highly recommended for a quick overview. Below are some examples.

Function str_c concatenates strings:

str_c("Hello", "Hi", "Hey", sep = "! ")
## [1] "Hello! Hi! Hey"

We can find the indices of matches in a character vector with str_which:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
str_which(chr_vector, "hu")
## [1] 1 3

Similarly, str_detect gives a Boolean vector of matching:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
str_detect(chr_vector, "hu")
## [1]  TRUE FALSE  TRUE FALSE

If we want to get the strings matching a pattern, we can use str_subset:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
str_subset(chr_vector, "hu")
## [1] "huhu" "huhu"

Replacing all matches with another string works with str_replace_all:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
str_replace_all(chr_vector, "h", "B")
## [1] "BuBu"  "Bello" "BuBu"  "ciao"

For data preparation, we often need to split strings by a particular character. For instance, a set of reaction times could be separated by a character line “|”. We can split this string representation to get individual measurements like so:

# three measures of reaction time in a single string
reaction_times <- "123|234|345"
# notice that we need to doubly (!) escape character |
# notice also that the result is a list (see below)
str_split(reaction_times, "\\|", n = 3)
## [[1]]
## [1] "123" "234" "345"

2.2.5 Factors

Factors are special vectors, which treat their elements as instances of a finite set of categories. To create a factor, we can use the function factor. The following code creates a factor from a character vector. Notice that, when printing, we get information of the kinds of entries (= categories) that occurred in the original character vector:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
factor(chr_vector)
## [1] huhu  hello huhu  ciao 
## Levels: ciao hello huhu

For plotting or other representational purposes, it can help to manually specify an ordering on the levels of a factor using the levels argument:

# the order of levels is changed manually
factor(chr_vector, levels = c("huhu", "ciao", "hello"))
## [1] huhu  hello huhu  ciao 
## Levels: huhu ciao hello

Even though we specified an ordering among factor levels, the last code chunk nonetheless creates what R treats as an unordered factor. There are also genuine ordered factors. An ordered factor is created by setting the argument ordered = T, and optionally also specifying a specific ordering of factor levels, like so:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
factor(
  chr_vector,    # the vector to treat as factor
  ordered = T,   # make sure it's treated as ordered factor
  levels = c("huhu", "ciao", "hello")  # specify order of levels by hand
)
## [1] huhu  hello huhu  ciao 
## Levels: huhu < ciao < hello

Having both unordered and ordered factors is useful for representing data from experiments, e.g., from categorical or ordinal variables (see Chapter 3). The difference between an unordered factor with explicit ordering information and an ordered factor is subtle and not important in the beginning. (This only matters, for example, in the context of regression modeling.)

Factors are trickier to work with than mere vectors because they are rigid about the represented factor levels. Adding an item that does not belong to any of a factor’s levels, leads to trouble:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
my_factor <- factor(
  chr_vector,    # the vector to treat as factor
  ordered = T,   # make sure it's treated as ordered factor
  levels = c("huhu", "ciao", "hello")  # specify order of levels
)
my_factor[5] <- "huhu"  # adding a "known category" is okay
my_factor[6] <- "moin"  # adding an "unknown category" does not work
my_factor
## [1] huhu  hello huhu  ciao  huhu  <NA> 
## Levels: huhu < ciao < hello

The forcats package from the tidyverse helps in dealing with factors. You should check the Cheat Sheet for more helpful functionality. Here is an example of how to expand the levels of a factor:

chr_vector <- c("huhu", "hello", "huhu", "ciao")
my_factor <- factor(
  chr_vector,    # the vector to treat as factor
  ordered = T,   # make sure it's treated as ordered factor
  levels = c("huhu", "ciao", "hello")  # specify order of levels
)
my_factor[5] <- "huhu"  # adding a "known category" is okay
my_factor <- fct_expand(my_factor, "moin") # add new category
my_factor[6] <- "moin"  # adding new item now works
my_factor
## [1] huhu  hello huhu  ciao  huhu  moin 
## Levels: huhu < ciao < hello < moin

It is sometimes useful (especially for plotting) to flexibly reorder the levels of an ordered factor. Here are some useful functions from the forcats package:

my_factor               # original factor
## [1] huhu  hello huhu  ciao  huhu  moin 
## Levels: huhu < ciao < hello < moin
fct_rev(my_factor)      # reverse level order 
## [1] huhu  hello huhu  ciao  huhu  moin 
## Levels: moin < hello < ciao < huhu
fct_relevel(            # manually supply new level order 
  my_factor,
  c("hello", "ciao", "huhu")
)      
## [1] huhu  hello huhu  ciao  huhu  moin 
## Levels: hello < ciao < huhu < moin

2.2.6 Lists, data frames & tibbles

Lists are key-value pairs. They are created with the built-in function list. The difference between a list and a named vector is that in the latter, all elements must be of the same type. In a list, the elements can be of arbitrary type. They can also be vectors or even lists themselves. For example:

my_list <- list(
  single_number = 42,
  chr_vector    = c("huhu", "ciao"),
  nested_list   = list(x = 1, y = 2, z = 3) 
)
my_list
## $single_number
## [1] 42
## 
## $chr_vector
## [1] "huhu" "ciao"
## 
## $nested_list
## $nested_list$x
## [1] 1
## 
## $nested_list$y
## [1] 2
## 
## $nested_list$z
## [1] 3

To access a list element by its name (= key), we can use the $ sign followed by the unquoted name, double square brackets [[ "name" ]] with the quoted name inside, or indices in double brackets, like so:

# all of these return the same list element
my_list$chr_vector
## [1] "huhu" "ciao"
my_list[["chr_vector"]]
## [1] "huhu" "ciao"
my_list[[2]]
## [1] "huhu" "ciao"

Lists are very important in R because almost all structured data that belongs together is stored as lists. Objects are special kinds of lists. Data is stored in special kinds of lists, so-called data frames or so-called tibbles.

A data frame is base R’s standard format to store data in. A data frame is a list of vectors of equal length. Data sets are instantiated with the function data.frame:

# fake experimental data
exp_data <- data.frame(
  trial = 1:5,
  condition = factor(
    c("C1", "C2", "C1", "C3", "C2"),
    ordered = T
  ),
  response = c(121, 133, 119, 102, 156)
)
exp_data
##   trial condition response
## 1     1        C1      121
## 2     2        C2      133
## 3     3        C1      119
## 4     4        C3      102
## 5     5        C2      156

Exercise 2.7

Create a vector a that contains the names of three of your best (imaginary) friends and a vector b with their (imaginary) age. Create a data frame that represents this information (one column with names and one with respective age). Notice that column names should represent the information they contain!

a <- c("M", "N", "H")
b <- c(23, 41, 13)
best_friends <- data.frame(name = a, age = b)
best_friends
##   name age
## 1    M  23
## 2    N  41
## 3    H  13

We can access columns of a data frame, just like we access elements in a list. Additionally, we can also use index notation, like in a matrix:

# gives the value of the cell in row 2, column 3
exp_data[2, 3] # returns 133
## [1] 133

Exercise 2.8

Display the column of names of your (imaginary) friends from the best_friends data frame.

best_friends["name"] 
##   name
## 1    M
## 2    N
## 3    H
best_friends[1] 
##   name
## 1    M
## 2    N
## 3    H

Now show only the names of friends who are younger than 22 (or some other age that makes sense for your friends and their ages). [Hint: you can write x <= 22 to get a Boolean vector of the same length as x with an entry TRUE at all indices where x is no bigger than 22.]

best_friends[best_friends$age <= 22, "name"]
## [1] "H"

In RStudio, you can inspect data in data frames (and tibbles (see below)) with the function View.

Tibbles are the tidyverse counterpart of data frames. We can cast a data frame into a tibble, using as_tibble. Notice that the information shown for a tibble is much richer than what is provided when printing the content of a data frame.

as_tibble(exp_data)
## # A tibble: 5 × 3
##   trial condition response
##   <int> <ord>        <dbl>
## 1     1 C1             121
## 2     2 C2             133
## 3     3 C1             119
## 4     4 C3             102
## 5     5 C2             156

We can also create a tibble directly with the keyword tibble. Indeed, the creation of tibbles is conveniently more flexible than the creation of data frames: the former allows dynamic look-up of previously defined elements.

my_tibble    <- tibble(x = 1:10, y = x^2)      # dynamic construction possible
my_dataframe <- data.frame(x = 1:10, y = x^2)  # ERROR :/

Another important difference between data frames and tibbles concerns the default treatment of character (= string) vectors. When reading in data from a CSV file as a data frame (using function read.csv), each character vector is treated as a factor by default. But when using read_csv to read CSV data into a tibble character vector are not treated as factors.

There is also a very convenient function, called tribble, which allows you to create a tibble by explicitly writing out the information in the rows.

hw_points <- tribble(
  ~hw_nr,       ~Jax,   ~Jamie,   ~Jason,
  "HW1",        33,     24,       17,
  "HW2",        41,     23,       8
)
hw_points
## # A tibble: 2 × 4
##   hw_nr   Jax Jamie Jason
##   <chr> <dbl> <dbl> <dbl>
## 1 HW1      33    24    17
## 2 HW2      41    23     8

Exercise 2.9

Assign to the variable bff a tibble with the following columns (with reasonable names): at least four names of your (imaginary) best friends, their current country of residence, their age, and a Boolean column storing whether they are not older than 23. Ideally, use dynamic construction and the <= operator as in previous exercises.

bff <- tibble(
  name = c("A", "B", "C", "D"),
  residence = c("UK", "JP", "CH", "JA"),
  age = c(24, 45, 72, 12),
  young = age <= 23
)
bff
## # A tibble: 4 × 4
##   name  residence   age young
##   <chr> <chr>     <dbl> <lgl>
## 1 A     UK           24 FALSE
## 2 B     JP           45 FALSE
## 3 C     CH           72 FALSE
## 4 D     JA           12 TRUE

  1. If you are familiar with Python’s scipy and numpy packages, this is R’s default mode of treating numerical information.↩︎

  2. Python, on the other hand, uses the reverse row-major mode.↩︎

  3. It is in this sense that the “first index moves fastest” in column-major mode, which is another frequently given explanation of column-major mode.↩︎

  4. Notice that we can create strings (actually called ‘characters’ in R) with double quotes.↩︎