Agenda



1. REVIEW: Types of variables


Dependent variable –> ? Independent variable –> ?



2. REVIEW: Factorial Designs

Read more stuff about factorial designs: here.



3. Defining “Messy Data”


Data and how it arrives at our doorstep is not always easily accessible or interpretable. It often needs to be transformed into a representation in which we can analyze (plotting, statistical analyses) it better. Data can also be full of superflous (to our purposes) information, which then needs to be “cleaned”. All of this happens in the domain of “preprocessing”: what happens to our data pre-analysis.

In the lecture, the notion of “tidy data” was introduced. What does this mean?

Following Wickham (2014), a tidy representation of (rectangular) data is defined as one where

- each variable forms a column, - each observation forms a row, and - each type of observational unit forms a table.

- bus also no redundancy

Any data set that is not tidy, we will call “messy data”.



4. First steps: Read in Data


data import cheat sheet

How to manually read in data: Import Dataset -> From Text (readr) -> Browse -> Adjust Parameters -> Import

Exercise:

Let’s try it once together: Download the data set “berlin_airbnb.csv” containing airbnb data from: StudIP -> files -> scroll all the way down!

# Loading a NEW DATASET
# checking the working directory   
getwd()
## [1] "/home/tah/Documents/IDA2019/tutorials"
# you can also use "setwd(dir)" to point your R path torwards your desired working directory or 
# speficy a path in the "read_csv()" function directly 

# Reading the file       
berlin_bnb <- read_csv("./berlin_airbnb.csv")
glimpse(berlin_bnb)
## Observations: 24,422
## Variables: 15
## $ id                             <dbl> 1944, 2015, 3176, 3309, 6883, 707…
## $ name                           <chr> "cafeheaven Pberg/Mitte/Wed for t…
## $ host_id                        <dbl> 2164, 2217, 3718, 4108, 16149, 17…
## $ host_name                      <chr> "Lulah", "Ion", "Britta", "Jana",…
## $ Neighbourhood                  <chr> "Brunnenstr. Nord, Mitte", "Brunn…
## $ latitude                       <dbl> 52.54425, 52.53454, 52.53500, 52.…
## $ longitude                      <dbl> 13.39749, 13.40256, 13.41758, 13.…
## $ room_type                      <chr> "Private room", "Entire home/apt"…
## $ price                          <dbl> 21, 60, 90, 28, 125, 33, 180, 70,…
## $ minimum_nights                 <dbl> 120, 4, 62, 7, 3, 2, 6, 90, 30, 6…
## $ number_of_reviews              <dbl> 18, 127, 145, 27, 126, 253, 7, 24…
## $ last_review                    <date> 2018-11-11, 2019-09-05, 2019-06-…
## $ reviews_per_month              <dbl> 0.25, 3.03, 1.16, 0.36, 1.08, 2.0…
## $ calculated_host_listings_count <dbl> 1, 5, 1, 1, 1, 2, 1, 4, 1, 2, 1, …
## $ availability_365               <dbl> 364, 170, 349, 262, 7, 33, 37, 30…



5. Data Wrangling


If a data set contains too much information for your current purposes, you can discard irrelevant (or unhelpful) rows and columns.

5.1. Selecting columns

The function select allows to pick out a subset of columns. Interestingly, it can also be used to reorder columns, because the order in which column names are specified matches the order in the returned tibble.

By Selecting columns that we want, in the order that we want them:

id, name, host_id, host_name, room_type, price,Neighbourhood, number_of_reviews, last_review, reviews_per_month, availability_365.

berlin_bnb_proc <- berlin_bnb %>% 
  select(id, name, host_id, host_name, room_type, price, Neighbourhood, number_of_reviews, last_review, reviews_per_month, availability_365)

By selecting columns that we do not want:

Suppose we do not want columns that have the ID info of the place and the host, i.e. id, host_id.

berlin_bnb_proc <- berlin_bnb %>% 
  select(-id, -host_id) 

5.2. Tidy selection of column names

To select the colums in several functions within the tidyverse, such as pivot_longer() or select(), there are other useful helper functions from the tidyselect package. Here are some examples.

To select columns using their indices:

berlin_bnb_proc %>% 
  select(2:5)
## # A tibble: 24,422 x 4
##    name                                  host_id host_name    room_type    
##    <chr>                                   <dbl> <chr>        <chr>        
##  1 cafeheaven Pberg/Mitte/Wed for the s…    2164 Lulah        Private room 
##  2 Berlin-Mitte Value! Quiet courtyard/…    2217 Ion          Entire home/…
##  3 Fabulous Flat in great Location          3718 Britta       Entire home/…
##  4 BerlinSpot Schöneberg near KaDeWe        4108 Jana         Private room 
##  5 Stylish East Side Loft in Center wit…   16149 Steffen      Entire home/…
##  6 BrightRoom with sunny greenview!        17391 BrightRoom   Private room 
##  7 Geourgeous flat - outstanding views     33852 Philipp      Entire home/…
##  8 Apartment in Prenzlauer Berg            55531 Chris + Oli… Entire home/…
##  9 APARTMENT TO RENT                       59666 Melanie      Private room 
## 10 In the Heart of Berlin - Kreuzberg      64696 Rene         Entire home/…
## # … with 24,412 more rows

To select columns starting with some “String”:

berlin_bnb_proc %>% 
  select(starts_with("host"))
## # A tibble: 24,422 x 2
##    host_id host_name     
##      <dbl> <chr>         
##  1    2164 Lulah         
##  2    2217 Ion           
##  3    3718 Britta        
##  4    4108 Jana          
##  5   16149 Steffen       
##  6   17391 BrightRoom    
##  7   33852 Philipp       
##  8   55531 Chris + Oliver
##  9   59666 Melanie       
## 10   64696 Rene          
## # … with 24,412 more rows

To select columns ending with some “String”:

berlin_bnb_proc %>% 
  select(ends_with("id"))
## # A tibble: 24,422 x 2
##       id host_id
##    <dbl>   <dbl>
##  1  1944    2164
##  2  2015    2217
##  3  3176    3718
##  4  3309    4108
##  5  6883   16149
##  6  7071   17391
##  7  9991   33852
##  8 14325   55531
##  9 16401   59666
## 10 16644   64696
## # … with 24,412 more rows

To select columns containing some “String”:

berlin_bnb_proc %>% 
  select(matches('review'))
## # A tibble: 24,422 x 3
##    number_of_reviews last_review reviews_per_month
##                <dbl> <date>                  <dbl>
##  1                18 2018-11-11               0.25
##  2               127 2019-09-05               3.03
##  3               145 2019-06-27               1.16
##  4                27 2019-05-31               0.36
##  5               126 2019-09-08               1.08
##  6               253 2019-09-06               2.06
##  7                 7 2019-07-15               0.14
##  8                24 2019-07-01               0.21
##  9                 0 NA                      NA   
## 10                48 2017-12-14               0.42
## # … with 24,412 more rows

You can also use this select(matches()) command to select columns with multiple string matches. To select all columns containing ‘host’ or ‘review’ in their names:

berlin_bnb_proc %>% 
  select(matches('host|review'))
## # A tibble: 24,422 x 5
##    host_id host_name      number_of_reviews last_review reviews_per_month
##      <dbl> <chr>                      <dbl> <date>                  <dbl>
##  1    2164 Lulah                         18 2018-11-11               0.25
##  2    2217 Ion                          127 2019-09-05               3.03
##  3    3718 Britta                       145 2019-06-27               1.16
##  4    4108 Jana                          27 2019-05-31               0.36
##  5   16149 Steffen                      126 2019-09-08               1.08
##  6   17391 BrightRoom                   253 2019-09-06               2.06
##  7   33852 Philipp                        7 2019-07-15               0.14
##  8   55531 Chris + Oliver                24 2019-07-01               0.21
##  9   59666 Melanie                        0 NA                      NA   
## 10   64696 Rene                          48 2017-12-14               0.42
## # … with 24,412 more rows

If you have columns in your dataset that begin with a certain string, say ‘Week_’ and have integers following it, like: ‘Week_01’, ‘Week_02’, and so on, you may use:

your_dataset %>% 
  select(num_range("Week_0", 1:5))

5.3. Slicing

To select rows by an index or a vector of indices, use the slice function:

berlin_bnb_proc %>% 
  # keep only entries from rows with an even index from 2 to 6
  slice(c(2,4,6))
## # A tibble: 3 x 11
##      id name  host_id host_name room_type price Neighbourhood
##   <dbl> <chr>   <dbl> <chr>     <chr>     <dbl> <chr>        
## 1  2015 Berl…    2217 Ion       Entire h…    60 Brunnenstr. …
## 2  3309 Berl…    4108 Jana      Private …    28 Schöneberg-N…
## 3  7071 Brig…   17391 BrightRo… Private …    33 Helmholtzpla…
## # … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, availability_365 <dbl>
berlin_bnb_proc %>% 
  # for getting the last entry in the dataset
  slice(n())
## # A tibble: 1 x 11
##       id name  host_id host_name room_type price Neighbourhood
##    <dbl> <chr>   <dbl> <chr>     <chr>     <dbl> <chr>        
## 1 3.87e7 Good…  6.04e7 Aline     Entire h…    70 Friedrichsfe…
## # … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, availability_365 <dbl>

5.4. Pulling a variable

One can use the ‘$’ operator to pull one column/variable out of the dataset like:

# pulling the variable 'name' from the dataset that contains the names of all the places in the berlin airbnb database
berlin_bnb_proc$name

Equivalently, one can also use the pull function, like:

# pull out the 'name' variable from the dataset
berlin_bnb_proc%>% pull(name)
# pull out the 'name' variable from the dataset and compute its length
berlin_bnb_proc %>% pull(name) %>% length
## [1] 24422
# pull out the 'name' variable from the dataset and compute the number of unique entries
berlin_bnb_proc %>% pull(name) %>% unique %>% length
## [1] 23719

5.5. Filtering

The function filter takes a Boolean expression and returns only those rows of which the Boolean expression is true.

Suppose we want to have some info only about rooms that are of the shared type:

unique(berlin_bnb_proc$room_type)
## [1] "Private room"    "Entire home/apt" "Hotel room"      "Shared room"
berlin_bnb_proc %>% 
  # keep only entries where the rooms are shared
  filter(room_type=="Shared room") %>% 
  # show only the host_id and host_name columns for the filtered query
  select(matches('host')) %>% 
  # If you also want the total count of the shared rooms in your database
   print
## # A tibble: 300 x 2
##     host_id host_name           
##       <dbl> <chr>               
##  1  1391772 Sunflower Hostel    
##  2  7345295 Marleen             
##  3  8250486 Singer Hostel Berlin
##  4  8250486 Singer Hostel Berlin
##  5  8250486 Singer Hostel Berlin
##  6  8250486 Singer Hostel Berlin
##  7 10402919 Uwe                 
##  8 15374065 Zahari              
##  9 18700789 Lena                
## 10 21239469 Stefan              
## # … with 290 more rows
Exercise:

Filter out all the places (names and ids) in ‘berlin_bnb_proc’ which have 0 reviews and store the new dataset in ‘berlin_bnb_notrecomm’.

5.6. Adding new columns

To add a new column, or to change an existing one use mutate().

Suppose you want to label the places if they are expensive or not. One way to do it could be to create a new (logical) column Expensive that stores TRUE if the place’s price is more than 100 and FALSE otherwise.

berlin_bnb_proc %>% 
  mutate(
    # add a new column called 'Expensive' based on the price
    'expensive' = price >= 100
  )
## # A tibble: 24,422 x 12
##       id name  host_id host_name room_type price Neighbourhood
##    <dbl> <chr>   <dbl> <chr>     <chr>     <dbl> <chr>        
##  1  1944 cafe…    2164 Lulah     Private …    21 Brunnenstr. …
##  2  2015 Berl…    2217 Ion       Entire h…    60 Brunnenstr. …
##  3  3176 Fabu…    3718 Britta    Entire h…    90 Prenzlauer B…
##  4  3309 Berl…    4108 Jana      Private …    28 Schöneberg-N…
##  5  6883 Styl…   16149 Steffen   Entire h…   125 Frankfurter …
##  6  7071 Brig…   17391 BrightRo… Private …    33 Helmholtzpla…
##  7  9991 Geou…   33852 Philipp   Entire h…   180 Prenzlauer B…
##  8 14325 Apar…   55531 Chris + … Entire h…    70 Prenzlauer B…
##  9 16401 APAR…   59666 Melanie   Private …   120 Frankfurter …
## 10 16644 In t…   64696 Rene      Entire h…    90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## #   expensive <lgl>

Now suppose you want to create a new column Price_range that contains keywords:

  • cheap if the price is less than or equal to 50
  • moderate if the price is greater than 50 but less than or equal to 100
  • expensive if the price is greater than 100

You can use the function case_when() that allows you to vectorise multiple if_else() statements as below:

berlin_bnb_proc %>% 
  mutate(
    # add a new column called 'Expensive' based on the price
    Price_range = case_when(
      price<=50 ~ 'cheap',
      price>50&price<=100 ~ 'moderate',
      price>100 ~ 'expensive'
    )
  )
## # A tibble: 24,422 x 12
##       id name  host_id host_name room_type price Neighbourhood
##    <dbl> <chr>   <dbl> <chr>     <chr>     <dbl> <chr>        
##  1  1944 cafe…    2164 Lulah     Private …    21 Brunnenstr. …
##  2  2015 Berl…    2217 Ion       Entire h…    60 Brunnenstr. …
##  3  3176 Fabu…    3718 Britta    Entire h…    90 Prenzlauer B…
##  4  3309 Berl…    4108 Jana      Private …    28 Schöneberg-N…
##  5  6883 Styl…   16149 Steffen   Entire h…   125 Frankfurter …
##  6  7071 Brig…   17391 BrightRo… Private …    33 Helmholtzpla…
##  7  9991 Geou…   33852 Philipp   Entire h…   180 Prenzlauer B…
##  8 14325 Apar…   55531 Chris + … Entire h…    70 Prenzlauer B…
##  9 16401 APAR…   59666 Melanie   Private …   120 Frankfurter …
## 10 16644 In t…   64696 Rene      Entire h…    90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## #   Price_range <chr>

Note: Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general. For example, this won’t work:

berlin_bnb_proc %>% 
  mutate(
    # add a new column called 'Expensive' based on the price
    Price_range = case_when(
      price<=50 ~ 'cheap',
      price>50 ~ 'expensive',
      price>50&price<=100 ~ 'moderate'
    )
  )
## # A tibble: 24,422 x 12
##       id name  host_id host_name room_type price Neighbourhood
##    <dbl> <chr>   <dbl> <chr>     <chr>     <dbl> <chr>        
##  1  1944 cafe…    2164 Lulah     Private …    21 Brunnenstr. …
##  2  2015 Berl…    2217 Ion       Entire h…    60 Brunnenstr. …
##  3  3176 Fabu…    3718 Britta    Entire h…    90 Prenzlauer B…
##  4  3309 Berl…    4108 Jana      Private …    28 Schöneberg-N…
##  5  6883 Styl…   16149 Steffen   Entire h…   125 Frankfurter …
##  6  7071 Brig…   17391 BrightRo… Private …    33 Helmholtzpla…
##  7  9991 Geou…   33852 Philipp   Entire h…   180 Prenzlauer B…
##  8 14325 Apar…   55531 Chris + … Entire h…    70 Prenzlauer B…
##  9 16401 APAR…   59666 Melanie   Private …   120 Frankfurter …
## 10 16644 In t…   64696 Rene      Entire h…    90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## #   Price_range <chr>

Note: NA values in the column do not get special treatment. If you want to explicitly handle NA values you can use the is.na function.

Exercise:

Add a column ‘popularity’ with keywords: ‘very popular’, ‘moderately popular’, ‘less popular’ and ‘Not recommended’ based on number of reviews :

  • very popular: >200 reviews
  • moderately: 50-200 reviews
  • less popular: <50
  • not recommended: =0 or NA.

5.7. Changing, Renaming an existing column

The strings for the types of rooms in the database are rather messy (e.g., ‘Entire room/apt’) and contain the word ‘room’ in every entry of the column room_type. So let us make it more elegant and reader-friendly, using mutate(), by deleting the word ‘room’ from all the entries and changing ‘Entire room/apt’ to ‘Apartment’.

Suppose you also wish to change the name of this column to type_of_room and give an ordering (ordered factor) to the type of room as: Apartment < Private < Hotel < Shared.

Note: If you want to indicate a fixed order of the recurring elements in a (character) vector, e.g., for plotting in a particular order or an experimental variable that is a factorial design, you should make this column an ordered factor.

unique(berlin_bnb_proc$room_type)
## [1] "Private room"    "Entire home/apt" "Hotel room"      "Shared room"
berlin_bnb_proc <- berlin_bnb_proc %>% 
  mutate(
    # change the entries in the column 'room_type'
    room_type = case_when(
      room_type=='Private room' ~ 'Private',
      room_type=='Entire home/apt' ~ 'Apartment',
      room_type=='Hotel room' ~ 'Hotel',
      room_type=='Shared room' ~ 'Shared'
    )
  ) %>% 
  # rename the column
  rename(type_of_room=room_type) %>% 
  # turn the column into an ordered factor
  mutate(
    type_of_room=factor(
      type_of_room,
      ordered=T,
      levels=c("Apartment","Private", "Hotel","Shared")
    )
  )
  
unique(berlin_bnb_proc$type_of_room)
## [1] Private   Apartment Hotel     Shared   
## Levels: Apartment < Private < Hotel < Shared
Exercise:

Change the name of the ‘price’ column to ‘price_per_night(in€)’.

5.8. Sorting

If you want to order a data set along a column, e.g., for inspection or printing as a table, then you can do that using the arrange() function. You can specify several columns to sort alpha-numerically in ascending order, and also indicate a descending order using the desc() function:

berlin_bnb_proc %>% 
  # arranges by price from lowest to highest, and equal price entries are decided by
  # the number of reviews: highest to lowest (desc())
  arrange(price, desc(number_of_reviews))
## # A tibble: 24,422 x 11
##        id name  host_id host_name type_of_room price Neighbourhood
##     <dbl> <chr>   <dbl> <chr>     <ord>        <dbl> <chr>        
##  1 1.86e7 Doub…  4.94e7 Svenja    Private          0 Marienfelde,…
##  2 2.07e7 Mode…  1.21e6 TRIO Apa… Apartment        0 Marzahn-Süd,…
##  3 2.04e7 Brig…  3.44e7 René      Private          0 Frankfurter …
##  4 2.82e7 Hote…  2.13e8 Hotel Th… Hotel            0 Südliche Fri…
##  5 2.04e7 *Nic…  8.10e7 Isabell   Private          0 Tempelhofer …
##  6 2.12e7 One …  1.02e7 Jeanne    Private          0 Frankfurter …
##  7 2.27e7 Schö…  1.67e8 Raafat    Private          1 Moabit Ost, …
##  8 1.63e7 Comf…  8.09e7 Paolo     Private          8 Südliche Fri…
##  9 2.45e7 Nice…  8.09e7 Paolo     Private          8 Südliche Fri…
## 10 2.19e7 Sing…  1.60e8 Dhia      Shared           8 Albrechtstr.…
## # … with 24,412 more rows, and 4 more variables: number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>
Exercise:

Arrange all the entries with ‘Pankow’ neighbourhood group and ‘Private’ type of room based on the number of reviews from highest to lowest and and the availability_365, also from highest to lowest. Display only the place names and the price.

5.9. Splitting and uniting columns

Suppose you have a dataset like below:

bnb_posting_ratings_messy <- 
  tribble(
    ~airbnb_posting,                                   ~rating,
    "Nice room near S-Bahnhof Schöneberg",              "4,4.5,3,4.5",
    "Spacious flat in the middle of Wedding",           "3.5,4,4,5",
    "Apartment for two people on Kudamm",               "5,4.5,5,4.5"
  ) %>% 
  print
## # A tibble: 3 x 2
##   airbnb_posting                         rating     
##   <chr>                                  <chr>      
## 1 Nice room near S-Bahnhof Schöneberg    4,4.5,3,4.5
## 2 Spacious flat in the middle of Wedding 3.5,4,4,5  
## 3 Apartment for two people on Kudamm     5,4.5,5,4.5

This is not a useful representation. The review scores for each apartment are stored together in a single column, separated by comma and furthermore stored as a character vector.

To disentangle information in a single column, you could use the separate() function:

bnb_posting_ratings_messy  %>% 
  separate(
    col = rating,                           # which column to split up
    into = str_c("guest_rating", 1:4),      # names of the new column to store results
    sep = ",",                              # separate by which character / reg-exp
    convert = T                           # convert = TRUE can automatically detect integer/numeric values
    ) %>%                           
  print
## # A tibble: 3 x 5
##   airbnb_posting    guest_rating1 guest_rating2 guest_rating3 guest_rating4
##   <chr>                     <dbl>         <dbl>         <int>         <dbl>
## 1 Nice room near S…           4             4.5             3           4.5
## 2 Spacious flat in…           3.5           4               4           5  
## 3 Apartment for tw…           5             4.5             5           4.5

If you have reason to perform the reverse operation, i.e., join together several columns, use the unite() function.

Exercise:

Look at the variable “Neighbourhood” in your dataset “berlin_bnb_proc”. Looks like we could split it into the two variables “address” and “district”.

5.10. Grouped operations

A frequently occurring problem in data analysis is to obtain a summary statistic. You may be interested in calculating the average price of lodging in a certain district. Also, arrange it in the ascending order of the price.

grouped_avg_price <-berlin_bnb_proc %>%
  # grouping by 'district'
  group_by(district) %>% 
  # summarizing based on the grouping
  summarize(                          # summarise() is typically used on grouped data created by
                                      # group_by().The output will have one row for each group.
                                       
    avg_price = mean(price)
  ) %>% 
  # arranging in ascending order of 'avg_price'
  arrange(avg_price)  %>% 
  print
## # A tibble: 13 x 2
##    district                  avg_price
##    <chr>                         <dbl>
##  1 Reinickendorf                  49.1
##  2 Neukölln                       50.1
##  3 Spandau                        57.7
##  4 Wartenberg und Falkenberg      57.8
##  5 Lichtenberg                    58.9
##  6 Treptow - Köpenick             59  
##  7 Marzahn - Hellersdorf          59.4
##  8 Steglitz - Zehlendorf          62.1
##  9 Friedrichshain-Kreuzberg       66.0
## 10 Pankow                         71.6
## 11 Mitte                          83.1
## 12 Tempelhof - Schöneberg         97.3
## 13 Charlottenburg-Wilm.          112.

We can also use “mutate()” with grouping.

berlin_bnb_proc %>% 
  group_by(district) %>% 
  # generating a new column "avg_price" containing mean prices "district"-wise.
  mutate(
    avg_price = mean(price)
  )
## # A tibble: 24,422 x 13
## # Groups:   district [13]
##       id name  host_id host_name type_of_room price address district
##    <dbl> <chr>   <dbl> <chr>     <ord>        <dbl> <chr>   <chr>   
##  1  1944 cafe…    2164 Lulah     Private         21 Brunne… Mitte   
##  2  2015 Berl…    2217 Ion       Apartment       60 Brunne… Mitte   
##  3  3176 Fabu…    3718 Britta    Apartment       90 Prenzl… Pankow  
##  4  3309 Berl…    4108 Jana      Private         28 Schöne… Tempelh…
##  5  6883 Styl…   16149 Steffen   Apartment      125 Frankf… Friedri…
##  6  7071 Brig…   17391 BrightRo… Private         33 Helmho… Pankow  
##  7  9991 Geou…   33852 Philipp   Apartment      180 Prenzl… Pankow  
##  8 14325 Apar…   55531 Chris + … Apartment       70 Prenzl… Pankow  
##  9 16401 APAR…   59666 Melanie   Private        120 Frankf… Friedri…
## 10 16644 In t…   64696 Rene      Apartment       90 nördli… Friedri…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## #   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## #   avg_price <dbl>

If you are looking for a short summary of the average price by district as computed above, then you just need to select the “avg_price” column from this new tibble and call summary(). This command will give you information on the average price in each district of Berlin like: the max.-min. value of the average price, etc.

berlin_bnb_proc %>% 
  group_by(district) %>% 
  summarize(
    avg_price = mean(price)
  ) %>% 
  select(avg_price) %>% summary() 
##    avg_price     
##  Min.   : 49.10  
##  1st Qu.: 57.80  
##  Median : 59.43  
##  Mean   : 67.99  
##  3rd Qu.: 71.64  
##  Max.   :111.72

Note: It is important to remember that after a call of group_by, the resulting tibbles retains the grouping information for all subsequent operations.

Exercise:

Filter out all places in ‘Pankow’ which are ‘expensive’, group them by the ‘Type_of_room’ and produce a summary of the average prices.

5.11. Combining tibbles

Imagine you have a set of data. You then obtain another set of data that shares one variable with the former but also additional variables and values.

In the examplatory mini data sets below, we see that the two tribbles share the column “religion”. To conjoin these two sets, we may use the function “full_join()”.

“full_join()” will retain all information from both columns, even if not all values within the two columns to be conjoined is perfectly congruent.

  • You can also check out subtypes of this function, such as “right_join()”, “semi_join()” etc.
religion <- tribble(
  ~religion,    ~income,      ~count,
  "Atheist",    ">50k",       58,
  "Buddhist",    ">50k",      21,
  "Muslim",      ">50k",      3,
  "Atheist",      ">20k",    70,
  "Buddhist",    ">20k",     234,
  "Muslim",  ">20k",         34)

religion2 <- tribble(               # additional table with religion as a variable
  ~religion,    ~savings,
  "Atheist",      ">30k",   
  "Buddhist",    ">20k",
  "Muslim",      ">40k"
  
)

full_join(religion,religion2, by = "religion")
## # A tibble: 6 x 4
##   religion income count savings
##   <chr>    <chr>  <dbl> <chr>  
## 1 Atheist  >50k      58 >30k   
## 2 Buddhist >50k      21 >20k   
## 3 Muslim   >50k       3 >40k   
## 4 Atheist  >20k      70 >30k   
## 5 Buddhist >20k     234 >20k   
## 6 Muslim   >20k      34 >40k

Exercise: Full join the two datasets “berlin_bnb_proc”, “grouped_avg_price” by “district”.

Note: This gives the same output as when you computed the avg_price variable in 5.10. using the mutate() command. This is because mutate() adds a new column to the dataset containing the summary statistic such that the computed statistic (avg_price in this case) is added (repeatedly for each group) to each entry of the dataset.

5.12. Pivoting

What if your data is too wide or too long? Pivoting can go two ways: Making your data longer or making it wider.

5.12.1. pivot_longer() function

  • “pivot_longer()” takes several columns and gathers the values of all cells in these columns into a single novel column, the so-called “value column” (the column with the values of the cells to be gathered)
  • To not lose information, “pivot_longer()” also creates a second new column, the so-called “name column”, the column with the names of the original columns that we gathered together
  • pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns

  • “pivot_longer()” needs three pieces of information:
    1. which columns to spin around (function argument cols)
    2. the name of the to-be-created new value column (function argument “values_to”")
    3. the name of the to-be-created new name column (function argument “names_to”")

Look at the dataset “religin_messy” (or alternatively, look at the inbuilt dataset “relig_income”, it is the exact same data). How could this data possibly be improved?

Exercise: Fill in the function for the arguments “cols”, “names_to” and “values_to” in the code below:

# data() loads all inbuild datasets 
relig_income
## # A tibble: 18 x 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Agnostic      27        34        60        81        76       137
##  2 Atheist       12        27        37        52        35        70
##  3 Buddhist      27        21        30        34        33        58
##  4 Catholic     418       617       732       670       638      1116
##  5 Don’t k…      15        14        15        11        10        35
##  6 Evangel…     575       869      1064       982       881      1486
##  7 Hindu          1         9         7         9        11        34
##  8 Histori…     228       244       236       238       197       223
##  9 Jehovah…      20        27        24        24        21        30
## 10 Jewish        19        19        25        25        30        95
## 11 Mainlin…     289       495       619       655       651      1107
## 12 Mormon        29        40        48        51        56       112
## 13 Muslim         6         7         9        10         9        23
## 14 Orthodox      13        17        23        32        32        47
## 15 Other C…       9         7        11        13        13        14
## 16 Other F…      20        33        40        46        49        63
## 17 Other W…       5         2         3         4         2         7
## 18 Unaffil…     217       299       374       365       341       528
## # … with 4 more variables: `$75-100k` <dbl>, `$100-150k` <dbl>,
## #   `>150k` <dbl>, `Don't know/refused` <dbl>
religin_longtidy <-relig_income %>% 
  pivot_longer(
    cols = #your code here,                              
    names_to = #your code here,
    values_to = #your code here
  ) %>% print

5.12.2. pivot_wider() function

  • The inverse transformation to “pivot_longer()” is “pivot_wider()”

  • “pivot_wider()” widens data, increasing the number of columns and decreasing the number of rows

  • “pivot_wider()” picks out two columns, one column of values to distribute into new to-be-created columns, and one vector of names or groups which contains the information about the names of the to-be-created new columns

I have artificially created a flawed dataset. How could we possibly improve it?

relig_income_toowide <- 
  tibble(religion = rep(c('Agnostic', 'Buddhist', 'Catholic'), times = 2),
         income =    rep(c('income/year', 'savings'), each = 3),
         count = c('20-30k','30-40k','50-75k','>40k','>20k','>10k'))

relig_income_toowide
## # A tibble: 6 x 3
##   religion income      count 
##   <chr>    <chr>       <chr> 
## 1 Agnostic income/year 20-30k
## 2 Buddhist income/year 30-40k
## 3 Catholic income/year 50-75k
## 4 Agnostic savings     >40k  
## 5 Buddhist savings     >20k  
## 6 Catholic savings     >10k

Exercise: Fill in the function for “names_from” and “values_from” in the code below:

religin_widetidy <-relig_income_toowide %>% 
    pivot_wider(
    names_from = #your code here,          
    values_from = #your code here)  %>%    
    print



6. Some Data Cleaning


We clean the data in two consecutive steps:

6.1. Removing Irrelevant Stuff

Suppose we do not want to have entries in our dataset of places with 0 days availability. Perhaps, these are hosts that have stopped offering the airbnb services and instead of removing their listings from the airbnb website, they chose to change their availability to 0. Hence:

Remove all data of the places who have 0 in their availability_365 and also ‘NA’ in their last_review column.

berlin_bnb_proc<-berlin_bnb_proc %>% 
  filter(is.na(last_review)==FALSE,
         availability_365!=0)

6.2. Cleaning by Popularity/Activity

Suppose you wish to classify all hosts/places from the dataset which have a low ‘reviews_per_month’ value (see below) and which haven’t been reviewed since 01-September-2019 as ‘NOT ACTIVE’ while the rest as ‘ACTIVE’ hosts.

But first, look at the possible values of ‘reviews_per_month’ in the dataset, using:

berlin_bnb_proc %>% 
  select(reviews_per_month) %>% summary()
##  reviews_per_month
##  Min.   : 0.02    
##  1st Qu.: 0.46    
##  Median : 1.08    
##  Mean   : 1.72    
##  3rd Qu.: 2.38    
##  Max.   :39.51

The range of ‘reviews_per_month’ varies between 0.02 and 39.51. Notice that if a place has less than 0.5 ‘reviews_per_month’, it means that on average this place was reviewed once every two months (indicating less activity of the host or less popularity of the place).

So, using this threshold value of 0.5 for the ‘reviews_per_month’, you can carry out the above filtering process!

Create a separate column in your dataset ‘Activity’ that stores this information. And then remove all ‘NOT ACTIVE’ places from the database and store this new dataset with all the information, except: id, host_id and Activity columns.

berlin_bnb_proc<- berlin_bnb_proc %>% 
  mutate(
    Activity = case_when(
      reviews_per_month<0.5 & last_review<as.Date("2019-09-01") ~ 'NOT ACTIVE',
      TRUE ~ 'ACTIVE'
    )
  ) %>% 
  filter(
    Activity=='ACTIVE'
  ) %>% 
  select(-id, -host_id, -Activity)

Glimpse at your dataset once again, before writing it to the system:

glimpse(berlin_bnb_proc)
## Observations: 8,204
## Variables: 10
## $ name              <chr> "Berlin-Mitte Value! Quiet courtyard/very cent…
## $ host_name         <chr> "Ion", "Britta", "Steffen", "BrightRoom", "Mat…
## $ type_of_room      <ord> Apartment, Apartment, Apartment, Private, Apar…
## $ price             <dbl> 60, 90, 125, 33, 49, 129, 70, 98, 160, 65, 209…
## $ address           <chr> "Brunnenstr. Süd", "Prenzlauer Berg Südwest", …
## $ district          <chr> "Mitte", "Pankow", "Friedrichshain-Kreuzberg",…
## $ number_of_reviews <dbl> 127, 145, 126, 253, 247, 74, 60, 76, 276, 129,…
## $ last_review       <date> 2019-09-05, 2019-06-27, 2019-09-08, 2019-09-0…
## $ reviews_per_month <dbl> 3.03, 1.16, 1.08, 2.06, 2.12, 0.87, 0.55, 0.67…
## $ availability_365  <dbl> 170, 349, 7, 33, 269, 148, 109, 107, 233, 194,…



7. Writing File back to the System


getwd()
## [1] "/home/tah/Documents/IDA2019/tutorials"
write_csv(berlin_bnb_proc, "./berlin_airbnb_processed.csv")



Questions? Please ask!