When we look at a situation in terms of probability, we first need to define a few terms. Let’s say I have a bag of M&M’s:
And I want to know the distribution of colours within the bag, i.e. the probability of drawing each colour at random. This act of randomly drawing a single M&M and looking at its colour is an example of a random process. Any random process is defined by its elementary outcomes, i.e. the set of all of its possible “results”. For our M&M example this would be \[\Omega_{MnM} = \{brown,\ blue,\ orange,\ red\}\]
Any subset \(A\) of \(\Omega\) (\(A \subseteq \Omega\)) is called an event, e.g picking a red or orange M&M is defined as the event \[A = \{orange,\ red\}\] We can directly see that the probability of that event is 0.5 (5/10), whereas the probability of picking a blue M&M is 0.2 (2/10). When probabilities cannot be inferred easily with the naked eye we use probability distributions. They are functions that assign a probability to any event you throw at it.
Let’s look at a standard card deck with 52 cards and consider the random process of picking one card out of the deck.
What is \(\Omega\) and what does it mean when I say that \(P(\Omega) = 1\)? Why is the statement true?
# YOUR ANSWER HERE
Name a few events that could happen as a result of our random process.
# YOUR ANSWER HERE
What is the probability of picking a black card?
# YOUR ANSWER HERE
What is the probability of picking a Queen?
# YOUR ANSWER HERE
What is the probability of picking a spade or a red King?
# YOUR ANSWER HERE
With many problems in the real world the corresponding probability distribution cannot be described in such a discrete and straightforward way as we have done for the deck of cards. This is why we need a way to approximate probability distributions, which we can do by defining them as a function that returns a representative sample of the distribution.
sample_size <- 100
x <- seq(-5, 5, length = sample_size)
# Returns the true densities for each value (not actual samples), stays constant across different trials
# Invariant to changes in sample size
y_dist <- dnorm(x, mean = 0, sd = 1)
qplot(x, y_dist, geom = "line")
y_dist <- dnorm(x, mean = 0, sd = 1)
qplot(x, y_dist, geom = "line")
y_dist <- dnorm(x, mean = 0, sd = 1)
qplot(x, y_dist, geom = "line")
y_dist <- dnorm(x, mean = 0, sd = 1)
qplot(x, y_dist, geom = "line")
# Returns random samples from the true distribution, the resulting density plot looks slightly different with each new sample
# Sensitive to changes in sample_size
y_sample <- rnorm(sample_size, mean = 0, sd = 1)
qplot(y_sample, geom = "density")
y_sample <- rnorm(sample_size, mean = 0, sd = 1)
qplot(y_sample, geom = "density")
y_sample <- rnorm(sample_size, mean = 0, sd = 1)
qplot(y_sample, geom = "density")
y_sample <- rnorm(sample_size, mean = 0, sd = 1)
qplot(y_sample, geom = "density")
# Binomial distribution
x_axis <- 1:50
binom_dist_05 <- as_tibble(dbinom(x_axis, size = length(x_axis), prob = 0.5))
binom_dist_07 <- as_tibble(dbinom(x_axis, size = length(x_axis), prob = 0.7))
binom_dist_01 <- as_tibble(dbinom(x_axis, size = length(x_axis), prob = 0.1))
# binom_dist_30_033 <- as_tibble(dbinom(x_axis, size = 30, prob = 0.33))
ggplot(mapping = aes(x = x_axis, y = value)) +
geom_line(data = binom_dist_05, color = "red") +
geom_line(data = binom_dist_07, color = "blue") +
geom_line(data = binom_dist_01, color = "green")
# Poisson distribution
x_axis <- 1:50
pois_dist_1 <- as_tibble(dpois(x_axis, lambda = 1))
pois_dist_4 <- as_tibble(dpois(x_axis, lambda = 4))
pois_dist_10 <- as_tibble(dpois(x_axis, lambda = 10))
ggplot(mapping = aes(x = x_axis, y = value)) +
geom_line(data = pois_dist_1, color = "red") +
geom_line(data = pois_dist_4, color = "blue") +
geom_line(data = pois_dist_10, color = "green")
“Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.”
Rank the following statements in order of their probability:
1. Linda is active in the feminist movement
2. Linda is a bank teller
3. Linda is a bank teller and is active in the feminist movement
# YOUR ANSWER HERE
If you rated statement 3 as more likely than statement 2, you are unfortunately wrong. But you are definitely not alone, 85% of participants in Tversky and Kahneman’s original study (1983) did the same thing. It is much easier to recognise this so-called Conjunction Fallacy once you visualise the 3 statements in a Venn diagram:
The third statement will never be more likely than the second one because the third places two restrictions on Linda’s occupations (she is a bank teller and a feminist at the same time), whereas the second only places one constraint (she is a bank teller and she may or not be a feminist).
Marginal probabilities refer to the probability of an event occurring on its own, without considering any other possible events, e.g. the marginal probability of Linda being a bank teller is \[P(bank) = 0.2\] and the marginal probability of Linda being active in the feminist movement is \[P(feminist) = 0.9\]
Joint probabilities refer to the probability of two (or more) events occuring at the same time. If both events are independent from one another, i.e. the occurence of one does not influence the occurence of the other in any way, the joint probability is simply the product of both marginal probabilities, e.g. the probability of Linda being both a bank teller and active in the feminist movement is \[P(bank, feminist) = P(bank) \cdot P(feminist) = 0.2 \cdot 0.9 = 0.18\]
Conditional probabilities refer to the probability of an event given that you already know the outcome of another event. This is actually how we have been calculating the marginal and joint probabilities so far. We implicitly had Linda’s description in mind when defining them. In reality, the probability of being a bank teller or a feminist should be much lower because we need to account for the whole population, not just for people that match Linda’s description. Put in numbers: \[P(feminist) = 0.15, \ P(feminist \ | \ description) = 0.9\]
Possible explanation for the Conjunction Fallacy: Many people might have interpreted the first two statements to implicitly exclude the not mentioned event:
1. Linda is active in the feminist movement
2. Linda is a bank teller
3. Linda is a bank teller and is active in the feminist movement
Let’s just assume \(P(feminist) = 0.9, \ P(\neg feminist) = 0.1, \ P(bank) = 0.2, \ P(\neg bank) = 0.8, \ \) to make calculation easier
\[1.\ P(feminist, \neg bank) = P(feminist) \cdot P(\neg bank) = 0.9 \cdot 0.8 = 0.72\] \[2.\ P(bank, \neg feminist) = P(bank) \cdot P(\neg feminist) = 0.2 \cdot 0.1 = 0.02\] \[3.\ P(bank, feminist) = P(bank) \cdot P(feminist) = 0.2 \cdot 0.9 = 0.18\]
As you can see, if you assume most participants interpreted the 3 statements in the way shown above, it does make sense to rate 3 as more likely than 2 (even though we’ve shown it to be false in probabilistic terms). But in any case, we just mention the Conjunction Fallacy here for illustration purposes, it is not relevant to the homework or the exam.
Let’s get back to our standard card deck and consider the structured event of rolling a die and drawing a card from a different suit depending on the number we rolled. If we roll a 1 or a 2, we’ll draw a heart.
If we roll a 3 or a 4, we’ll draw a diamond.
If we roll a 5, we’ll draw a club.
If we roll a 6 we’ll draw a spade.
What are the marginal probabilities?
# YOUR ANSWER HERE
What is the conditional probability P(face | hearts)?
# YOUR ANSWER HERE
What is the conditional probability P(face | spades)? Why is this equal to P(face | hearts)?
# YOUR ANSWER HERE
For those on the subjective-interpretation side of probabilities, Bayes’ Rule is an indispensable tool! Because for the Bayesians, probabilities can be used to quantify degrees of belief (assumptions) and based on these degrees of belief, Bayesians perform inference.
But,
Thus,
In the field of data analysis, one of the primary goals is to analyze and then explain the data. We work with models (potential “causes”) to explain our data (“observables”):
But,
Finally,
When thinking in terms of models to explain our data, Bayes’ rule becomes:
We get to update our beliefs as more evidence/data accumulates! In other words, with the accumulation of each data point, the posterior distribution becomes prior in an iterative process.
The essence of Bayesian inference is to be “less wrong” as more data accumulates.
The result of a medical test could be positive or negative (something we observe). Let’s say it is positive.
The result may be caused by presence of disease (true positive), or absence of it (false positive), when the testing device is inherently error-prone.
Our Goal: To determine the degree of belief in the test, i.e. the disease actually present when the test is positive.
Suppose you wish to classify emails as Spam or Not Spam based on a finite set of keywords appearing in the email.
Let us suppose, for simplicity, that we only need to check for the absence or presence of two keywords \({K_1, K_2}\) in order to classify the email.
Our Goal: To determine whether an email should be classified as Spam or Not Spam after looking for the presence of the two keywords. Let us say: \(K_1 = present (1), K_2 = absent (0)\) in the email.
The entire output of a factory is produced on three machines M1, M2, M3. The three machines account for 20%, 30%, and 50% of the factory output. The fraction of defective items produced is 5% for the first machine; 3% for the second machine; and 1% for the third machine.
If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine?
Simply put, a Random Variable is a variable whose values depend on a random/stochastic process.
A Random Variable \(X\) must be conceptualized as a function whose:
domain is a sample space \(\Omega\) of possible outcomes of a random phenomenon, and
range is
Since a random variable’s values are determined by a stochastic process, every random variable is characterized by a probability distribution.
If the Random Variable \(X\) is:
NOTE: whenever we talk about random variables, there are inadvertently two probability distributions involved. One is of the random stochastic process that generates values of the random variable (the not so interesting one), and the other of the random variable itself (the one we are more interested in).
The Expectation \(\mathbb{E}_X\) of a random variable \(X\) represents the average of a large number of independent realizations of \(X\).
For random variables which are:
If \(X\) is a discrete random variable such that it takes values from the finite set \(S\) = {0,1,2,3,4}.
## # A tibble: 5 x 2
## x px
## <dbl> <dbl>
## 1 0 0.2
## 2 1 0.2
## 3 2 0.2
## 4 3 0.2
## 5 4 0.2
## [1] 1 4 0 1 1 3 2 4 2 3
## [1] 2.1
As the number of samples increase to 10000, the average of these 10000 realizations of \(X\) would become closer and closer to the mean of {0,1,2,3,4} i.e., 2.
If \(X\) were biased (more likely) towards taking the value 4 such that:
## # A tibble: 5 x 2
## x px_biased
## <dbl> <dbl>
## 1 0 0.05
## 2 1 0.05
## 3 2 0.05
## 4 3 0.05
## 5 4 0.8
and if now you were to sample 10000 realizations of \(X_{biased}\), what would your average value be close to?
## [1] 3.5045
Seems like the value of this average of \(X_{biased}\) now has a tendency to converge to 3.5.
In fact, 3.5 is the expected value of the biased random variable \(X_{biased}\) as below:
\(\mathbb{E}_{X_{biased}} = \sum_{i} x_i.f_{X_{biased}}(x_i) = 0\times0.05 + 1\times0.05 + 2\times0.05 + 3\times0.05 + 4\times0.80\) \(\mathbb{E}_{X_{biased}} = 0 + 0.05 + 0.10 + 0.15 + 3.2 = 3.5\)
The Variance \(\mathbb{E}_X\) of a random variable \(X\) represents how far a large number of independent realizations of \(X\) have been spread out from its expected value.
For random variables which are:
Consider again \(X\) as a discrete random variable such that it takes values from the finite set \(S\) = {0,1,2,3,4}.
## [1] 2.009856
The value of variance seems to be close to 2, which is the true variance of \(X\): \(Var(X) = 0.2\times(0-2)^2 + 0.2\times(1-2)^2 + 0.2\times(2-2)^2 + 0.2\times(3-2)^2 + 0.2\times(4-2)^2\) \(Var(X) = 0.8 + 0.2 + 0 + 0.2 + 0.8 = 2.0\)
When \(X\) is biased (\(X_{biased}\)) towards 4 as above and you were to take the variance of 10000 values of \(X\):
## [1] 1.241248
number of possible outcomes: 36
Probability distribution of the outcomes of two dice
possible outcomes: {2,3,4,5,6,7,8,9,10,11,12}.
Probability distribution of the sum of numbers
## # A tibble: 11 x 2
## sum probability
## <dbl> <dbl>
## 1 2 0.0278
## 2 3 0.0556
## 3 4 0.0833
## 4 5 0.111
## 5 6 0.139
## 6 7 0.167
## 7 8 0.139
## 8 9 0.111
## 9 10 0.0833
## 10 11 0.0556
## 11 12 0.0278
expectation <- function(values, prob){
E = sum(values*prob)
return(E)
}
expectation(diceroll_sum$sum, diceroll_sum$probability)
## [1] 7
variance <- function(values, prob){
v = sum((values-sum(values*prob))^2*prob)
return(v)
}
variance(diceroll_sum$sum, diceroll_sum$probability)
## [1] 5.833333
#variance(rep(1,36), diceroll$probability)
A Cumulative Distributions \(F_X(x)\) accounts for the probability that the value of a random variable \(X\) is less than \(x\) .
For random variables which are:
diceroll_sum %>%
ggplot(mapping = aes(x = sum, y = cumsum(probability)))+
geom_col(fill="#0072B2") +
scale_x_continuous(name = "sum of the two dice' outcomes", breaks = c(2,3,4,5,6,7,8,9,10,11,12))+
ylab("probability")+
ggtitle("Cumulative Distribution of the Sum of Numbers on Roll of Two Dice ")