```
<- 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6)
expected_value <- 1^2*(1/6) + 2^2*(1/6) + 3^2*(1/6) + 4^2*(1/6) + 5^2*(1/6) + 6^2*(1/6) - expected_value^2
variance
print(expected_value)
```

`## [1] 3.5`

` variance`

`## [1] 2.916667`

So far, we have defined a probability distribution as a function that assigns a probability to each subset of the space \(\Omega\) of elementary outcomes. We saw that rational beliefs should conform to certain axioms, reflecting a “logic of rational beliefs”. But in data analysis, we are often interested in a space of numeric outcomes. You probably know stuff like the “normal distribution” which is a distribution that assigns a probability to each real number. In keeping with our previous definition of probability as targeting a measurable set \(\Omega\), we introduce what we could sloppily call “probability distributions over numbers” using the concept of random variables. Caveat: random variables are very useful concepts and offer highly versatile notation, but both concept and notation can be elusive in the beginning.

Formally, a **random variable** is a function \(X \ \colon \ \Omega \rightarrow \mathbb{R}\) that assigns to each elementary outcome a numerical value.
It is reasonable to think of this number as a **summary statistic**: a number that captures one aspect of relevance of what is actually a much more complex chunk of reality.

Traditionally, random variables are represented by capital letters, like \(X\). The numeric values they take on are written as small letters, like \(x\).

We write \(P(X = x)\) as a shorthand for the probability \(P(\left \{ \omega \in \Omega \mid X(\omega) = x \right \})\), that an event \(\omega\) occurs which is mapped onto \(x\) by the random variable \(X\). For example, if our coin is fair, then \(P(X_{\text{two flips}} = x) = 0.5\) for \(x=1\) and \(0.25\) for \(x \in \{0,2\}\). Similarly, we can also write \(P(X \le x)\) for the probability of observing any event that \(X\) maps to a number not bigger than \(x\).

If the range of \(X\) is countable (not necessarily finite), we say that \(X\) is **discrete**. For ease of exposition, we may say that if the range of \(X\) is an interval of real numbers, \(X\) is called **continuous**.

For a discrete random variable \(X\), the **cumulative distribution function** \(F_X\) associated with \(X\) is defined as:
\[
F_X(x) = P(X \le x) = \sum_{x' \in \left \{ x'' \in \text{range}(X) \mid x'' \le x \right \}} P(X = x')
\]
The **probability mass function** \(f_x\) associated with \(X\) is defined as:
\[
f_X(x) = P(X = x)
\]

**Example.** Suppose we flip a coin with a bias of \(\theta\) towards heads \(n\) times. What is the probability that we will see heads \(k\) times? If we map the outcome of heads to 1 and tails to 0, this probability is given by the Binomial distribution, as follows:
\[
\text{Binom}(K = k ; n, \theta) = \binom{n}{k} \, \theta^{k} \, (1-\theta)^{n-k}
\]
Here \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) is the binomial coefficient, which gives the number of possibilities of drawing an unordered subset with \(k\) elements from a set with a total of \(n\) elements. Figure 7.3 gives examples of the Binomial distribution, concretely its probability mass functions, for two values of the coin’s bias, \(\theta = 0.25\) or \(\theta = 0.5\), when flipping the coin \(n=24\) times. Figure 7.4 gives the corresponding cumulative distributions.

For a continuous random variable \(X\), the probability \(P(X = x)\) will usually be zero: it is virtually impossible that we will see precisely the value \(x\) realized in a random event that can realize uncountably many numerical values of \(X\). However, \(P(X \le x)\) does usually take non-zero values and so we define the cumulative distribution function \(F_X\) associated with \(X\) as:
\[
F_X(x) = P(X \le x)
\]
Instead of a probability **mass** function, we derive a **probability density function** from the cumulative function as:
\[
f_X(x) = F'(x)
\]
A probability density function can take values greater than one, unlike a probability mass
function.

**Example.** The Gaussian (Normal) distribution characterizes many natural distributions of measurements which are symmetrically spread around a central tendency. It is defined as:
\[
\mathcal{N}(X = x ; \mu, \sigma) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \exp \left ( -
\frac{(x-\mu)^2}{2 \sigma^2} \right)
\]
where parameter \(\mu\) is the *mean*, the central tendency, and parameter \(\sigma\) is the *standard deviation*. Figure 7.5 gives examples of the probability density function of two normal distributions. Figure 7.6 gives the corresponding cumulative distribution functions.

The **expected value** of a random variable \(X\) is a measure of central tendency. It tells us, like the name suggests, which average value of \(X\) we can expect when repeatedly sampling from \(X\). If \(X\) is discrete, the expected value is:
\[
\mathbb{E}_X = \sum_{x} x \times f_X(x)
\]
If \(X\) is continuous, it is:
\[
\mathbb{E}_X = \int x \times f_X(x) \ \text{d}x
\]
The expected value is also frequently called the **mean**.

The **variance** of a random variable \(X\) is a measure of how much likely values of \(X\) are spread or clustered around the expected value. If \(X\) is discrete, the variance is:
\[
\text{Var}(X) = \sum_x (\mathbb{E}_X - x)^2 \times f_X(x) = \mathbb{E}_{X^2} -\mathbb{E}_X^2
\]
If \(X\) is continuous, it is:
\[
\text{Var}(X) = \int (\mathbb{E}_X - x)^2 \times f_X(x) \ \text{d}x = \mathbb{E}_{X^2} -\mathbb{E}_X^2
\]

**Example.** If we flip a coin with bias \(\theta = 0.25\) a total of \(n=24\) times, we expect on average to see \(n \times\theta = 24 \times 0.25 = 6\) outcomes showing heads.^{42} The variance of a binomially distributed variable is \(n \times\theta \times(1-\theta) = 24 \times 0.25 \times 0.75 = \frac{24 \times 3}{16} = \frac{18}{4} = 4.5\).

The expected value of a normal distribution is just its mean \(\mu\) and its variance is \(\sigma^2\).

**Exercise 7.5**

- Compute the expected value and variance of a fair die.

```
<- 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6)
expected_value <- 1^2*(1/6) + 2^2*(1/6) + 3^2*(1/6) + 4^2*(1/6) + 5^2*(1/6) + 6^2*(1/6) - expected_value^2
variance
print(expected_value)
```

`## [1] 3.5`

` variance`

`## [1] 2.916667`

- Below, you see several normal distributions with differing means \(\mu\) and standard deviations \(\sigma\). The red, unnumbered distribution is the so-called standard normal distribution; it has a mean of 0 and a standard deviation of 1. Compare each distribution below (1-4) to the standard normal distribution and think about how the parameters of the standard normal were changed. Also, think about which distribution (1-4) has the smallest/largest mean and the smallest/largest standard deviation.

Distribution 1 (\(\mu\) = 5, \(\sigma\) = 1): larger mean, same standard deviation

Distribution 2 (\(\mu\) = 0, \(\sigma\) = 3): same mean, larger standard deviation

Distribution 3 (\(\mu\) = 6, \(\sigma\) = 2): larger mean, larger standard deviation

Distribution 4 (\(\mu\) = -6, \(\sigma\) = 0.5): smaller mean, smaller standard deviation

Composite random variables are random variables generated by mathematical operations conjoining other random variables. For example, if \(X\) and \(Y\) are random variables, then we can define a new derived random variable \(Z\) using notation like:

\[Z = X + Y\]

This notation looks innocuous but is conceptually tricky yet ultimately very powerful. On the face of it, we are doing as if we are using `+`

to add two functions. But a sampling-based perspective makes this quite intuitive. We can think of \(X\) and \(Y\) as large samples, representing the probability distributions in question. Then we build a sample by just adding elements in \(X\) and \(Y\). (If samples are of different size, just add a random element of \(Y\) to each \(X\).)

Consider the following concrete example. \(X\) is the probability distribution of rolling a fair dice with six sides. \(Y\) is the probability distribution of flipping a biased coin that lands heads (represented as number 1) with probability 0.75. The derived probability distribution \(Z = X + Y\) can be approximately represented by samples derived as follows:

```
<- 1e6
n_samples # `n_samples` rolls of a fair dice
<- sample(
samples_x 1:6,
size = n_samples,
replace = T
)
# `n_samples` flips of a biased coin
<- sample(
samples_y c(0, 1),
prob = c(0.25, 0.75),
size = n_samples,
replace = T
)
<- samples_x + samples_y
samples_z
tibble(outcome = samples_z) %>%
::count(outcome) %>%
dplyrmutate(n = n / sum(n)) %>%
ggplot(aes(x = outcome, y = n)) +
geom_col() +
labs(y = "proportion")
```

This is not immediately obvious from our definition, but it is intuitive and you can derive it.↩︎