read_csv()
and write_csv()
functions in tidyverse.Dependent variable –> ? Independent variable –> ?
most common approach to include multiple independent variables in one experiment
independent variable in the context of factorial design can also be called a factor
in a factorial design, each level of one independent variable is combined with each level of the others to produce all possible combinations
each combination then becomes a condition in the experiment
a factorial design can include any number of independent variables with any number of levels. For example, an experiment could include the type of psychotherapy (cognitive vs. behavioral), the length of the psychotherapy (2 weeks vs. 2 months), and the sex of the psychotherapist (female vs. male)
This would be a 2 x 2 x 2 factorial design and would have eight conditions
In practice, it is unusual for there to be more than three independent variables with more than two or three levels each: strong increase of conditions with each added factor and need for participants to fill them
Read more stuff about factorial designs: here.
Data and how it arrives at our doorstep is not always easily accessible or interpretable. It often needs to be transformed into a representation in which we can analyze (plotting, statistical analyses) it better. Data can also be full of superflous (to our purposes) information, which then needs to be “cleaned”. All of this happens in the domain of “preprocessing”: what happens to our data pre-analysis.
In the lecture, the notion of “tidy data” was introduced. What does this mean?
Following Wickham (2014), a tidy representation of (rectangular) data is defined as one where
- each variable forms a column, - each observation forms a row, and - each type of observational unit forms a table.
- bus also no redundancy
Any data set that is not tidy, we will call “messy data”.
How to manually read in data: Import Dataset -> From Text (readr) -> Browse -> Adjust Parameters -> Import
Let’s try it once together: Download the data set “berlin_airbnb.csv” containing airbnb data from: StudIP -> files -> scroll all the way down!
# Loading a NEW DATASET
# checking the working directory
getwd()
## [1] "/home/tah/Documents/IDA2019/tutorials"
# you can also use "setwd(dir)" to point your R path torwards your desired working directory or
# speficy a path in the "read_csv()" function directly
# Reading the file
berlin_bnb <- read_csv("./berlin_airbnb.csv")
glimpse(berlin_bnb)
## Observations: 24,422
## Variables: 15
## $ id <dbl> 1944, 2015, 3176, 3309, 6883, 707…
## $ name <chr> "cafeheaven Pberg/Mitte/Wed for t…
## $ host_id <dbl> 2164, 2217, 3718, 4108, 16149, 17…
## $ host_name <chr> "Lulah", "Ion", "Britta", "Jana",…
## $ Neighbourhood <chr> "Brunnenstr. Nord, Mitte", "Brunn…
## $ latitude <dbl> 52.54425, 52.53454, 52.53500, 52.…
## $ longitude <dbl> 13.39749, 13.40256, 13.41758, 13.…
## $ room_type <chr> "Private room", "Entire home/apt"…
## $ price <dbl> 21, 60, 90, 28, 125, 33, 180, 70,…
## $ minimum_nights <dbl> 120, 4, 62, 7, 3, 2, 6, 90, 30, 6…
## $ number_of_reviews <dbl> 18, 127, 145, 27, 126, 253, 7, 24…
## $ last_review <date> 2018-11-11, 2019-09-05, 2019-06-…
## $ reviews_per_month <dbl> 0.25, 3.03, 1.16, 0.36, 1.08, 2.0…
## $ calculated_host_listings_count <dbl> 1, 5, 1, 1, 1, 2, 1, 4, 1, 2, 1, …
## $ availability_365 <dbl> 364, 170, 349, 262, 7, 33, 37, 30…
If a data set contains too much information for your current purposes, you can discard irrelevant (or unhelpful) rows and columns.
The function select allows to pick out a subset of columns. Interestingly, it can also be used to reorder columns, because the order in which column names are specified matches the order in the returned tibble.
By Selecting columns that we want, in the order that we want them:
id
, name
, host_id
, host_name
, room_type
, price
,Neighbourhood
, number_of_reviews
, last_review
, reviews_per_month
, availability_365
.
berlin_bnb_proc <- berlin_bnb %>%
select(id, name, host_id, host_name, room_type, price, Neighbourhood, number_of_reviews, last_review, reviews_per_month, availability_365)
By selecting columns that we do not want:
Suppose we do not want columns that have the ID info of the place and the host, i.e. id
, host_id
.
berlin_bnb_proc <- berlin_bnb %>%
select(-id, -host_id)
To select the colums in several functions within the tidyverse, such as pivot_longer() or select(), there are other useful helper functions from the tidyselect package. Here are some examples.
To select columns using their indices:
berlin_bnb_proc %>%
select(2:5)
## # A tibble: 24,422 x 4
## name host_id host_name room_type
## <chr> <dbl> <chr> <chr>
## 1 cafeheaven Pberg/Mitte/Wed for the s… 2164 Lulah Private room
## 2 Berlin-Mitte Value! Quiet courtyard/… 2217 Ion Entire home/…
## 3 Fabulous Flat in great Location 3718 Britta Entire home/…
## 4 BerlinSpot Schöneberg near KaDeWe 4108 Jana Private room
## 5 Stylish East Side Loft in Center wit… 16149 Steffen Entire home/…
## 6 BrightRoom with sunny greenview! 17391 BrightRoom Private room
## 7 Geourgeous flat - outstanding views 33852 Philipp Entire home/…
## 8 Apartment in Prenzlauer Berg 55531 Chris + Oli… Entire home/…
## 9 APARTMENT TO RENT 59666 Melanie Private room
## 10 In the Heart of Berlin - Kreuzberg 64696 Rene Entire home/…
## # … with 24,412 more rows
To select columns starting with some “String”:
berlin_bnb_proc %>%
select(starts_with("host"))
## # A tibble: 24,422 x 2
## host_id host_name
## <dbl> <chr>
## 1 2164 Lulah
## 2 2217 Ion
## 3 3718 Britta
## 4 4108 Jana
## 5 16149 Steffen
## 6 17391 BrightRoom
## 7 33852 Philipp
## 8 55531 Chris + Oliver
## 9 59666 Melanie
## 10 64696 Rene
## # … with 24,412 more rows
To select columns ending with some “String”:
berlin_bnb_proc %>%
select(ends_with("id"))
## # A tibble: 24,422 x 2
## id host_id
## <dbl> <dbl>
## 1 1944 2164
## 2 2015 2217
## 3 3176 3718
## 4 3309 4108
## 5 6883 16149
## 6 7071 17391
## 7 9991 33852
## 8 14325 55531
## 9 16401 59666
## 10 16644 64696
## # … with 24,412 more rows
To select columns containing some “String”:
berlin_bnb_proc %>%
select(matches('review'))
## # A tibble: 24,422 x 3
## number_of_reviews last_review reviews_per_month
## <dbl> <date> <dbl>
## 1 18 2018-11-11 0.25
## 2 127 2019-09-05 3.03
## 3 145 2019-06-27 1.16
## 4 27 2019-05-31 0.36
## 5 126 2019-09-08 1.08
## 6 253 2019-09-06 2.06
## 7 7 2019-07-15 0.14
## 8 24 2019-07-01 0.21
## 9 0 NA NA
## 10 48 2017-12-14 0.42
## # … with 24,412 more rows
You can also use this select(matches())
command to select columns with multiple string matches. To select all columns containing ‘host’ or ‘review’ in their names:
berlin_bnb_proc %>%
select(matches('host|review'))
## # A tibble: 24,422 x 5
## host_id host_name number_of_reviews last_review reviews_per_month
## <dbl> <chr> <dbl> <date> <dbl>
## 1 2164 Lulah 18 2018-11-11 0.25
## 2 2217 Ion 127 2019-09-05 3.03
## 3 3718 Britta 145 2019-06-27 1.16
## 4 4108 Jana 27 2019-05-31 0.36
## 5 16149 Steffen 126 2019-09-08 1.08
## 6 17391 BrightRoom 253 2019-09-06 2.06
## 7 33852 Philipp 7 2019-07-15 0.14
## 8 55531 Chris + Oliver 24 2019-07-01 0.21
## 9 59666 Melanie 0 NA NA
## 10 64696 Rene 48 2017-12-14 0.42
## # … with 24,412 more rows
If you have columns in your dataset that begin with a certain string, say ‘Week_’ and have integers following it, like: ‘Week_01’, ‘Week_02’, and so on, you may use:
your_dataset %>%
select(num_range("Week_0", 1:5))
To select rows by an index or a vector of indices, use the slice function:
berlin_bnb_proc %>%
# keep only entries from rows with an even index from 2 to 6
slice(c(2,4,6))
## # A tibble: 3 x 11
## id name host_id host_name room_type price Neighbourhood
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2015 Berl… 2217 Ion Entire h… 60 Brunnenstr. …
## 2 3309 Berl… 4108 Jana Private … 28 Schöneberg-N…
## 3 7071 Brig… 17391 BrightRo… Private … 33 Helmholtzpla…
## # … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, availability_365 <dbl>
berlin_bnb_proc %>%
# for getting the last entry in the dataset
slice(n())
## # A tibble: 1 x 11
## id name host_id host_name room_type price Neighbourhood
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 3.87e7 Good… 6.04e7 Aline Entire h… 70 Friedrichsfe…
## # … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, availability_365 <dbl>
One can use the ‘$’ operator to pull one column/variable out of the dataset like:
# pulling the variable 'name' from the dataset that contains the names of all the places in the berlin airbnb database
berlin_bnb_proc$name
Equivalently, one can also use the pull function, like:
# pull out the 'name' variable from the dataset
berlin_bnb_proc%>% pull(name)
# pull out the 'name' variable from the dataset and compute its length
berlin_bnb_proc %>% pull(name) %>% length
## [1] 24422
# pull out the 'name' variable from the dataset and compute the number of unique entries
berlin_bnb_proc %>% pull(name) %>% unique %>% length
## [1] 23719
The function filter takes a Boolean expression and returns only those rows of which the Boolean expression is true.
Suppose we want to have some info only about rooms that are of the shared type:
unique(berlin_bnb_proc$room_type)
## [1] "Private room" "Entire home/apt" "Hotel room" "Shared room"
berlin_bnb_proc %>%
# keep only entries where the rooms are shared
filter(room_type=="Shared room") %>%
# show only the host_id and host_name columns for the filtered query
select(matches('host')) %>%
# If you also want the total count of the shared rooms in your database
print
## # A tibble: 300 x 2
## host_id host_name
## <dbl> <chr>
## 1 1391772 Sunflower Hostel
## 2 7345295 Marleen
## 3 8250486 Singer Hostel Berlin
## 4 8250486 Singer Hostel Berlin
## 5 8250486 Singer Hostel Berlin
## 6 8250486 Singer Hostel Berlin
## 7 10402919 Uwe
## 8 15374065 Zahari
## 9 18700789 Lena
## 10 21239469 Stefan
## # … with 290 more rows
Filter out all the places (names and ids) in ‘berlin_bnb_proc’ which have 0 reviews and store the new dataset in ‘berlin_bnb_notrecomm’.
To add a new column, or to change an existing one use mutate()
.
Suppose you want to label the places if they are expensive or not. One way to do it could be to create a new (logical) column Expensive
that stores TRUE if the place’s price
is more than 100 and FALSE otherwise.
berlin_bnb_proc %>%
mutate(
# add a new column called 'Expensive' based on the price
'expensive' = price >= 100
)
## # A tibble: 24,422 x 12
## id name host_id host_name room_type price Neighbourhood
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 1944 cafe… 2164 Lulah Private … 21 Brunnenstr. …
## 2 2015 Berl… 2217 Ion Entire h… 60 Brunnenstr. …
## 3 3176 Fabu… 3718 Britta Entire h… 90 Prenzlauer B…
## 4 3309 Berl… 4108 Jana Private … 28 Schöneberg-N…
## 5 6883 Styl… 16149 Steffen Entire h… 125 Frankfurter …
## 6 7071 Brig… 17391 BrightRo… Private … 33 Helmholtzpla…
## 7 9991 Geou… 33852 Philipp Entire h… 180 Prenzlauer B…
## 8 14325 Apar… 55531 Chris + … Entire h… 70 Prenzlauer B…
## 9 16401 APAR… 59666 Melanie Private … 120 Frankfurter …
## 10 16644 In t… 64696 Rene Entire h… 90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## # expensive <lgl>
Now suppose you want to create a new column Price_range
that contains keywords:
cheap
if the price is less than or equal to 50moderate
if the price is greater than 50 but less than or equal to 100expensive
if the price is greater than 100You can use the function case_when()
that allows you to vectorise multiple if_else() statements as below:
berlin_bnb_proc %>%
mutate(
# add a new column called 'Expensive' based on the price
Price_range = case_when(
price<=50 ~ 'cheap',
price>50&price<=100 ~ 'moderate',
price>100 ~ 'expensive'
)
)
## # A tibble: 24,422 x 12
## id name host_id host_name room_type price Neighbourhood
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 1944 cafe… 2164 Lulah Private … 21 Brunnenstr. …
## 2 2015 Berl… 2217 Ion Entire h… 60 Brunnenstr. …
## 3 3176 Fabu… 3718 Britta Entire h… 90 Prenzlauer B…
## 4 3309 Berl… 4108 Jana Private … 28 Schöneberg-N…
## 5 6883 Styl… 16149 Steffen Entire h… 125 Frankfurter …
## 6 7071 Brig… 17391 BrightRo… Private … 33 Helmholtzpla…
## 7 9991 Geou… 33852 Philipp Entire h… 180 Prenzlauer B…
## 8 14325 Apar… 55531 Chris + … Entire h… 70 Prenzlauer B…
## 9 16401 APAR… 59666 Melanie Private … 120 Frankfurter …
## 10 16644 In t… 64696 Rene Entire h… 90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## # Price_range <chr>
Note: Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general. For example, this won’t work:
berlin_bnb_proc %>%
mutate(
# add a new column called 'Expensive' based on the price
Price_range = case_when(
price<=50 ~ 'cheap',
price>50 ~ 'expensive',
price>50&price<=100 ~ 'moderate'
)
)
## # A tibble: 24,422 x 12
## id name host_id host_name room_type price Neighbourhood
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 1944 cafe… 2164 Lulah Private … 21 Brunnenstr. …
## 2 2015 Berl… 2217 Ion Entire h… 60 Brunnenstr. …
## 3 3176 Fabu… 3718 Britta Entire h… 90 Prenzlauer B…
## 4 3309 Berl… 4108 Jana Private … 28 Schöneberg-N…
## 5 6883 Styl… 16149 Steffen Entire h… 125 Frankfurter …
## 6 7071 Brig… 17391 BrightRo… Private … 33 Helmholtzpla…
## 7 9991 Geou… 33852 Philipp Entire h… 180 Prenzlauer B…
## 8 14325 Apar… 55531 Chris + … Entire h… 70 Prenzlauer B…
## 9 16401 APAR… 59666 Melanie Private … 120 Frankfurter …
## 10 16644 In t… 64696 Rene Entire h… 90 nördliche Lu…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## # Price_range <chr>
Note: NA values in the column do not get special treatment. If you want to explicitly handle NA values you can use the is.na
function.
Add a column ‘popularity’ with keywords: ‘very popular’, ‘moderately popular’, ‘less popular’ and ‘Not recommended’ based on number of reviews :
The strings for the types of rooms in the database are rather messy (e.g., ‘Entire room/apt’) and contain the word ‘room’ in every entry of the column room_type
. So let us make it more elegant and reader-friendly, using mutate()
, by deleting the word ‘room’ from all the entries and changing ‘Entire room/apt’ to ‘Apartment’.
Suppose you also wish to change the name of this column to type_of_room
and give an ordering (ordered factor) to the type of room as: Apartment < Private < Hotel < Shared.
Note: If you want to indicate a fixed order of the recurring elements in a (character) vector, e.g., for plotting in a particular order or an experimental variable that is a factorial design, you should make this column an ordered factor.
unique(berlin_bnb_proc$room_type)
## [1] "Private room" "Entire home/apt" "Hotel room" "Shared room"
berlin_bnb_proc <- berlin_bnb_proc %>%
mutate(
# change the entries in the column 'room_type'
room_type = case_when(
room_type=='Private room' ~ 'Private',
room_type=='Entire home/apt' ~ 'Apartment',
room_type=='Hotel room' ~ 'Hotel',
room_type=='Shared room' ~ 'Shared'
)
) %>%
# rename the column
rename(type_of_room=room_type) %>%
# turn the column into an ordered factor
mutate(
type_of_room=factor(
type_of_room,
ordered=T,
levels=c("Apartment","Private", "Hotel","Shared")
)
)
unique(berlin_bnb_proc$type_of_room)
## [1] Private Apartment Hotel Shared
## Levels: Apartment < Private < Hotel < Shared
Change the name of the ‘price’ column to ‘price_per_night(in€)’.
If you want to order a data set along a column, e.g., for inspection or printing as a table, then you can do that using the arrange()
function. You can specify several columns to sort alpha-numerically in ascending order, and also indicate a descending order using the desc()
function:
berlin_bnb_proc %>%
# arranges by price from lowest to highest, and equal price entries are decided by
# the number of reviews: highest to lowest (desc())
arrange(price, desc(number_of_reviews))
## # A tibble: 24,422 x 11
## id name host_id host_name type_of_room price Neighbourhood
## <dbl> <chr> <dbl> <chr> <ord> <dbl> <chr>
## 1 1.86e7 Doub… 4.94e7 Svenja Private 0 Marienfelde,…
## 2 2.07e7 Mode… 1.21e6 TRIO Apa… Apartment 0 Marzahn-Süd,…
## 3 2.04e7 Brig… 3.44e7 René Private 0 Frankfurter …
## 4 2.82e7 Hote… 2.13e8 Hotel Th… Hotel 0 Südliche Fri…
## 5 2.04e7 *Nic… 8.10e7 Isabell Private 0 Tempelhofer …
## 6 2.12e7 One … 1.02e7 Jeanne Private 0 Frankfurter …
## 7 2.27e7 Schö… 1.67e8 Raafat Private 1 Moabit Ost, …
## 8 1.63e7 Comf… 8.09e7 Paolo Private 8 Südliche Fri…
## 9 2.45e7 Nice… 8.09e7 Paolo Private 8 Südliche Fri…
## 10 2.19e7 Sing… 1.60e8 Dhia Shared 8 Albrechtstr.…
## # … with 24,412 more rows, and 4 more variables: number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>
Arrange all the entries with ‘Pankow’ neighbourhood group and ‘Private’ type of room based on the number of reviews from highest to lowest and and the availability_365, also from highest to lowest. Display only the place names and the price.
Suppose you have a dataset like below:
bnb_posting_ratings_messy <-
tribble(
~airbnb_posting, ~rating,
"Nice room near S-Bahnhof Schöneberg", "4,4.5,3,4.5",
"Spacious flat in the middle of Wedding", "3.5,4,4,5",
"Apartment for two people on Kudamm", "5,4.5,5,4.5"
) %>%
print
## # A tibble: 3 x 2
## airbnb_posting rating
## <chr> <chr>
## 1 Nice room near S-Bahnhof Schöneberg 4,4.5,3,4.5
## 2 Spacious flat in the middle of Wedding 3.5,4,4,5
## 3 Apartment for two people on Kudamm 5,4.5,5,4.5
This is not a useful representation. The review scores for each apartment are stored together in a single column, separated by comma and furthermore stored as a character vector.
To disentangle information in a single column, you could use the separate()
function:
bnb_posting_ratings_messy %>%
separate(
col = rating, # which column to split up
into = str_c("guest_rating", 1:4), # names of the new column to store results
sep = ",", # separate by which character / reg-exp
convert = T # convert = TRUE can automatically detect integer/numeric values
) %>%
print
## # A tibble: 3 x 5
## airbnb_posting guest_rating1 guest_rating2 guest_rating3 guest_rating4
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Nice room near S… 4 4.5 3 4.5
## 2 Spacious flat in… 3.5 4 4 5
## 3 Apartment for tw… 5 4.5 5 4.5
If you have reason to perform the reverse operation, i.e., join together several columns, use the unite()
function.
Look at the variable “Neighbourhood” in your dataset “berlin_bnb_proc”. Looks like we could split it into the two variables “address” and “district”.
A frequently occurring problem in data analysis is to obtain a summary statistic. You may be interested in calculating the average price of lodging in a certain district. Also, arrange it in the ascending order of the price.
grouped_avg_price <-berlin_bnb_proc %>%
# grouping by 'district'
group_by(district) %>%
# summarizing based on the grouping
summarize( # summarise() is typically used on grouped data created by
# group_by().The output will have one row for each group.
avg_price = mean(price)
) %>%
# arranging in ascending order of 'avg_price'
arrange(avg_price) %>%
print
## # A tibble: 13 x 2
## district avg_price
## <chr> <dbl>
## 1 Reinickendorf 49.1
## 2 Neukölln 50.1
## 3 Spandau 57.7
## 4 Wartenberg und Falkenberg 57.8
## 5 Lichtenberg 58.9
## 6 Treptow - Köpenick 59
## 7 Marzahn - Hellersdorf 59.4
## 8 Steglitz - Zehlendorf 62.1
## 9 Friedrichshain-Kreuzberg 66.0
## 10 Pankow 71.6
## 11 Mitte 83.1
## 12 Tempelhof - Schöneberg 97.3
## 13 Charlottenburg-Wilm. 112.
We can also use “mutate()” with grouping.
berlin_bnb_proc %>%
group_by(district) %>%
# generating a new column "avg_price" containing mean prices "district"-wise.
mutate(
avg_price = mean(price)
)
## # A tibble: 24,422 x 13
## # Groups: district [13]
## id name host_id host_name type_of_room price address district
## <dbl> <chr> <dbl> <chr> <ord> <dbl> <chr> <chr>
## 1 1944 cafe… 2164 Lulah Private 21 Brunne… Mitte
## 2 2015 Berl… 2217 Ion Apartment 60 Brunne… Mitte
## 3 3176 Fabu… 3718 Britta Apartment 90 Prenzl… Pankow
## 4 3309 Berl… 4108 Jana Private 28 Schöne… Tempelh…
## 5 6883 Styl… 16149 Steffen Apartment 125 Frankf… Friedri…
## 6 7071 Brig… 17391 BrightRo… Private 33 Helmho… Pankow
## 7 9991 Geou… 33852 Philipp Apartment 180 Prenzl… Pankow
## 8 14325 Apar… 55531 Chris + … Apartment 70 Prenzl… Pankow
## 9 16401 APAR… 59666 Melanie Private 120 Frankf… Friedri…
## 10 16644 In t… 64696 Rene Apartment 90 nördli… Friedri…
## # … with 24,412 more rows, and 5 more variables: number_of_reviews <dbl>,
## # last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
## # avg_price <dbl>
If you are looking for a short summary of the average price by district as computed above, then you just need to select the “avg_price” column from this new tibble and call summary()
. This command will give you information on the average price in each district of Berlin like: the max.-min. value of the average price, etc.
berlin_bnb_proc %>%
group_by(district) %>%
summarize(
avg_price = mean(price)
) %>%
select(avg_price) %>% summary()
## avg_price
## Min. : 49.10
## 1st Qu.: 57.80
## Median : 59.43
## Mean : 67.99
## 3rd Qu.: 71.64
## Max. :111.72
Note: It is important to remember that after a call of group_by, the resulting tibbles retains the grouping information for all subsequent operations.
Filter out all places in ‘Pankow’ which are ‘expensive’, group them by the ‘Type_of_room’ and produce a summary of the average prices.
Imagine you have a set of data. You then obtain another set of data that shares one variable with the former but also additional variables and values.
In the examplatory mini data sets below, we see that the two tribbles share the column “religion”. To conjoin these two sets, we may use the function “full_join()”.
“full_join()” will retain all information from both columns, even if not all values within the two columns to be conjoined is perfectly congruent.
religion <- tribble(
~religion, ~income, ~count,
"Atheist", ">50k", 58,
"Buddhist", ">50k", 21,
"Muslim", ">50k", 3,
"Atheist", ">20k", 70,
"Buddhist", ">20k", 234,
"Muslim", ">20k", 34)
religion2 <- tribble( # additional table with religion as a variable
~religion, ~savings,
"Atheist", ">30k",
"Buddhist", ">20k",
"Muslim", ">40k"
)
full_join(religion,religion2, by = "religion")
## # A tibble: 6 x 4
## religion income count savings
## <chr> <chr> <dbl> <chr>
## 1 Atheist >50k 58 >30k
## 2 Buddhist >50k 21 >20k
## 3 Muslim >50k 3 >40k
## 4 Atheist >20k 70 >30k
## 5 Buddhist >20k 234 >20k
## 6 Muslim >20k 34 >40k
Exercise: Full join the two datasets “berlin_bnb_proc”, “grouped_avg_price” by “district”.
Note: This gives the same output as when you computed the avg_price
variable in 5.10. using the mutate() command. This is because mutate() adds a new column to the dataset containing the summary statistic such that the computed statistic (avg_price
in this case) is added (repeatedly for each group) to each entry of the dataset.
What if your data is too wide or too long? Pivoting can go two ways: Making your data longer or making it wider.
pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns
Look at the dataset “religin_messy” (or alternatively, look at the inbuilt dataset “relig_income”, it is the exact same data). How could this data possibly be improved?
Exercise: Fill in the function for the arguments “cols”, “names_to” and “values_to” in the code below:
# data() loads all inbuild datasets
relig_income
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137
## 2 Atheist 12 27 37 52 35 70
## 3 Buddhist 27 21 30 34 33 58
## 4 Catholic 418 617 732 670 638 1116
## 5 Don’t k… 15 14 15 11 10 35
## 6 Evangel… 575 869 1064 982 881 1486
## 7 Hindu 1 9 7 9 11 34
## 8 Histori… 228 244 236 238 197 223
## 9 Jehovah… 20 27 24 24 21 30
## 10 Jewish 19 19 25 25 30 95
## 11 Mainlin… 289 495 619 655 651 1107
## 12 Mormon 29 40 48 51 56 112
## 13 Muslim 6 7 9 10 9 23
## 14 Orthodox 13 17 23 32 32 47
## 15 Other C… 9 7 11 13 13 14
## 16 Other F… 20 33 40 46 49 63
## 17 Other W… 5 2 3 4 2 7
## 18 Unaffil… 217 299 374 365 341 528
## # … with 4 more variables: `$75-100k` <dbl>, `$100-150k` <dbl>,
## # `>150k` <dbl>, `Don't know/refused` <dbl>
religin_longtidy <-relig_income %>%
pivot_longer(
cols = #your code here,
names_to = #your code here,
values_to = #your code here
) %>% print
The inverse transformation to “pivot_longer()” is “pivot_wider()”
“pivot_wider()” widens data, increasing the number of columns and decreasing the number of rows
“pivot_wider()” picks out two columns, one column of values to distribute into new to-be-created columns, and one vector of names or groups which contains the information about the names of the to-be-created new columns
I have artificially created a flawed dataset. How could we possibly improve it?
relig_income_toowide <-
tibble(religion = rep(c('Agnostic', 'Buddhist', 'Catholic'), times = 2),
income = rep(c('income/year', 'savings'), each = 3),
count = c('20-30k','30-40k','50-75k','>40k','>20k','>10k'))
relig_income_toowide
## # A tibble: 6 x 3
## religion income count
## <chr> <chr> <chr>
## 1 Agnostic income/year 20-30k
## 2 Buddhist income/year 30-40k
## 3 Catholic income/year 50-75k
## 4 Agnostic savings >40k
## 5 Buddhist savings >20k
## 6 Catholic savings >10k
Exercise: Fill in the function for “names_from” and “values_from” in the code below:
religin_widetidy <-relig_income_toowide %>%
pivot_wider(
names_from = #your code here,
values_from = #your code here) %>%
print
We clean the data in two consecutive steps:
Suppose we do not want to have entries in our dataset of places with 0 days availability. Perhaps, these are hosts that have stopped offering the airbnb services and instead of removing their listings from the airbnb website, they chose to change their availability to 0. Hence:
Remove all data of the places who have 0 in their availability_365 and also ‘NA’ in their last_review column.
berlin_bnb_proc<-berlin_bnb_proc %>%
filter(is.na(last_review)==FALSE,
availability_365!=0)
Suppose you wish to classify all hosts/places from the dataset which have a low ‘reviews_per_month’ value (see below) and which haven’t been reviewed since 01-September-2019 as ‘NOT ACTIVE’ while the rest as ‘ACTIVE’ hosts.
But first, look at the possible values of ‘reviews_per_month’ in the dataset, using:
berlin_bnb_proc %>%
select(reviews_per_month) %>% summary()
## reviews_per_month
## Min. : 0.02
## 1st Qu.: 0.46
## Median : 1.08
## Mean : 1.72
## 3rd Qu.: 2.38
## Max. :39.51
The range of ‘reviews_per_month’ varies between 0.02 and 39.51. Notice that if a place has less than 0.5 ‘reviews_per_month’, it means that on average this place was reviewed once every two months (indicating less activity of the host or less popularity of the place).
So, using this threshold value of 0.5 for the ‘reviews_per_month’, you can carry out the above filtering process!
Create a separate column in your dataset ‘Activity’ that stores this information. And then remove all ‘NOT ACTIVE’ places from the database and store this new dataset with all the information, except: id
, host_id
and Activity
columns.
berlin_bnb_proc<- berlin_bnb_proc %>%
mutate(
Activity = case_when(
reviews_per_month<0.5 & last_review<as.Date("2019-09-01") ~ 'NOT ACTIVE',
TRUE ~ 'ACTIVE'
)
) %>%
filter(
Activity=='ACTIVE'
) %>%
select(-id, -host_id, -Activity)
Glimpse at your dataset once again, before writing it to the system:
glimpse(berlin_bnb_proc)
## Observations: 8,204
## Variables: 10
## $ name <chr> "Berlin-Mitte Value! Quiet courtyard/very cent…
## $ host_name <chr> "Ion", "Britta", "Steffen", "BrightRoom", "Mat…
## $ type_of_room <ord> Apartment, Apartment, Apartment, Private, Apar…
## $ price <dbl> 60, 90, 125, 33, 49, 129, 70, 98, 160, 65, 209…
## $ address <chr> "Brunnenstr. Süd", "Prenzlauer Berg Südwest", …
## $ district <chr> "Mitte", "Pankow", "Friedrichshain-Kreuzberg",…
## $ number_of_reviews <dbl> 127, 145, 126, 253, 247, 74, 60, 76, 276, 129,…
## $ last_review <date> 2019-09-05, 2019-06-27, 2019-09-08, 2019-09-0…
## $ reviews_per_month <dbl> 3.03, 1.16, 1.08, 2.06, 2.12, 0.87, 0.55, 0.67…
## $ availability_365 <dbl> 170, 349, 7, 33, 269, 148, 109, 107, 233, 194,…
getwd()
## [1] "/home/tah/Documents/IDA2019/tutorials"
write_csv(berlin_bnb_proc, "./berlin_airbnb_processed.csv")