7.3 Conditional probability

Let us assume probability distribution \(P \in \Delta(\Omega)\) and that events \(A,B \subseteq \Omega\) are given. The conditional probability of \(A\) given \(B\), written as \(P(A \mid B)\), gives the probability of \(A\) on the assumption that \(B\) is true.³⁹ It is defined like so:

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]

Conditional probabilities are only defined when \(P(B) > 0\).⁴⁰

Example. If a die is unbiased, each of its six faces has equal probability to come up after a toss. The probability of event \(B = \{\) ⚀, ⚂, ⚄ \(\}\) that the tossed number is odd has probability \(P(B) = \frac{1}{2}\). The probability of event \(A = \{\) ⚂, ⚃, ⚄, ⚅ \(\}\) that the tossed number is bigger than two is \(P(A) = \frac{2}{3}\). The probability that the tossed number is bigger than two and odd is \(P(A \cap B) = P(\{\) ⚂, ⚄ \(\}) = \frac{1}{3}\). The conditional probability of tossing a number that is bigger than two, when we know that the toss is odd, is \(P(A \mid B) = \frac{1 / 3}{1 / 2} = \frac{2}{3}\).

Algorithmically, conditional probability first rules out all events in which \(B\) is not true and then simply renormalizes the probabilities assigned to the remaining events in such a way that their relative probabilities remain unchanged. Given this, another way of interpreting conditional probability is that \(P(A \mid B)\) is what a rational agent should believe about \(A\) after observing (nothing more than) that \(B\) is true. The agent rules out, possibly hypothetically, that \(B\) is false, but otherwise does not change opinion about the relative probabilities of anything that is compatible with \(B\). This is also explained in the video embedded below.

7.3.1 Bayes rule

Looking back at the joint-probability distribution in Table 7.1, the conditional probability \(P(\text{black} \mid \text{heads})\) of drawing a black ball, given that the initial coin flip showed heads, can be calculated as follows:

\[ P(\text{black} \mid \text{heads}) = \frac{P(\text{black} , \text{heads})}{P(\text{heads})} = \frac{0.1}{0.5} = 0.2 \] This calculation, however, is quite excessive. We can read out the conditional probability directly already from the way the flip-and-draw scenario was set up. After flipping heads, we draw from urn 1, which has \(k=2\) out of \(N=10\) black balls, so clearly: if the initial flip comes up heads, then the probability of a black ball is \(0.2\). Indeed, in a step-wise random generative process like the flip-and-draw scenario, some conditional probabilities are very clear, and sometimes given by definition. These are, usually, the conditional probabilities that define how the process unfolds forward in time, so to speak.

Bayes rule is a way of expressing, in a manner of speaking, conditional probabilities in terms of the “reversed” conditional probabilities:

\[P(B \mid A) = \frac{P(A \mid B) \times P(B)}{P(A)}\]

Bayes rule is a straightforward corollary of the definition of conditional probabilities, according to which \(P(A \cap B) = P(A \mid B) \times P(B)\), so that:

\[ P(B \mid A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A \mid B) \times P(B)}{P(A)} \]

Bayes rule allows for reasoning backward from observed causes to likely underlying effects. When we have a feed-forward model of how unobservable effects probabilistically constrain observable outcomes, Bayes rule allows us to draw inferences about latent/unobservable variables based on the observation of their downstream effects.

Consider yet again the flip-and-draw scenario. But now assume that Jones flipped the coin and drew a ball. We see that it is black. What is the probability that it was drawn from urn 1, or equivalently, that the coin landed heads? It is not \(P(\text{heads}) = 0.5\), the so-called prior probability of the coin landing heads. It is a conditional probability, also called the posterior probability,⁴¹ namely \(P(\text{heads} \mid \text{black})\). But it is not as easy and straightforward to write down as the reverse probability \(P(\text{black} \mid \text{heads})\) of which we said above that it is an almost trivial part of the set up of the flip-and-draw scenario. It is here that Bayes rule has its purpose:

\[ P(\text{heads} \mid \text{black}) = \frac{P(\text{black} \mid \text{heads}) \times P(\text{heads})}{P(\text{black})} = \frac{0.2 \times 0.5}{0.3} = \frac{1}{3} \] This result is quite intuitive. Drawing a black ball from urn 2 (i.e., after seeing tails) is twice as likely as drawing a black ball from urn 1 (i.e., after seeing heads). Consequently, after seeing a black ball drawn, with equal probabilities of heads and tails, the probability that the coin landed tails is also twice as large as that it landed heads.

Exercise 7.4

Play around with the following WebPPL implementation of the flip-and-draw scenario, which calculates the posterior distribution over coin flip outcomes given that we observed the draw of a black ball. Change the parameters of the scenario and try to build intuitions about how your changes will affect the resulting posterior distribution.

// you can play around with the values of these variables
var coin_bias = 0.5          // coin bias
var prob_black_urn_1 = 0.2   // probability of drawing "black" from urn 1 
var prob_black_urn_2 = 0.4   // probability of drawing "black" from urn 2

///fold:
// flip-and-draw scenario model
var model = function() {
  var coin_flip = flip(coin_bias)  == 1 ? "heads" : "tails"
  var prob_black_selected_urn = coin_flip == "heads" ?
    prob_black_urn_1 : prob_black_urn_2
  var ball_color = flip(prob_black_selected_urn) == 1 ? "black" : "white"
  condition(ball_color == "black")
  return({coin: coin_flip})
}
// infer model and display as (custom-made) table
var inferred_model = Infer({method: 'enumerate'}, model)
viz(inferred_model)
///

Three possibilities for obtaining a value of 0.7 for the marginal probability of “black”:

prob_black_urn_1 = prob_black_urn_2 = 0.7
coin_bias = 1 and prob_black_urn_1 = 0.7
coin_bias = 0.5, prob_black_urn_1 = 0.8 and prob_black_urn_2 = 0.6

Suppose that we know that around 6% of the population has statisticositis, a rare disease that makes you allergic to fallacious statistical reasoning. A new test has been developed to diagnose statisticositis but it is not infallible. The specificity of the test (the test result is negative when the subject really does not have statisticositis) is 98%. The sensitivity of the test (the test result is positive when the subject really does have statisticositis) is 95%. When you take this test and it gives a negative test result, how likely is it that you do not have statisticositis?

First, let’s abbreviate the test result being negative or positive as \(\overline{T}\) and \(T\) and actual statisticositis as \(\overline{S}\) and \(S\). We want to calculate \(P(\overline{S} \mid \overline{T})\). According to Bayes rule, \(P(\overline{S} \mid \overline{T}) = \frac{P(\overline{T} \mid \overline{S}) P(\overline{S})} {P(\overline{T})}\). We are given that \(P(\overline{T} \mid \overline{S}) = 0.98\), \(P(\overline{T} \mid S) = 1 - P(T \mid S) = 0.05\) and \(P(\overline{S}) = 1 - P(S) = 0.94\). Furthermore, \(P(\overline{T}) = P(\overline{T},S) + P(\overline{T},\overline{S}) = P(\overline{T} \mid S) P(S) + P(\overline{T} \mid \overline{S}) P(\overline{S}) = 0.9242\). Putting this all together, we get \(P(\overline{S} \mid \overline{T}) \approx 99.7 \%\). So, given a negative test result, you can be pretty certain that you do not have statisticositis.

Check out this website for more details on these calculations in the context of a more serious application.

Excursion: Bayes rule for data analysis In later chapters, we will use Bayes rule for data analysis. The flip-and-draw scenario structurally “reflects” what will happen later. Think of the color of the ball drawn as the data \(D\) which we observe. Think of the coin as a latent parameter \(\theta\) of a statistical model. Bayes rule for data analysis then looks like this:

\[P(\theta \mid D) = \frac{P(D \mid \theta) \times P(\theta)}{P(D)}\]

We will discuss this at length in Chapter 8 and thereafter.

7.3.2 Stochastic (in-)dependence

Event \(A\) is stochastically independent of \(B\) if, intuitively speaking, learning \(B\) does not change one’s beliefs about \(A\), i.e., \(P(A \mid B) = P(A)\). If \(A\) is stochastically independent of \(B\), then \(B\) is stochastically independent of \(A\) because:

\[ \begin{aligned} P(B \mid A) & = \frac{P(A \mid B) \ P(B)}{P(A)} && \text{[Bayes rule]} \\ & = \frac{P(A) \ P(B)}{P(A)} && \text{[by ass. of independence]} \\ & = P(B) && \text{[cancellation]} \\ \end{aligned} \]

For example, imagine a flip-and-draw scenario where the initial coin flip has a bias of \(0.8\) towards heads, but each of the two urns has the same number of black balls, namely \(3\) black and \(7\) white balls. Intuitively and formally, the probability of drawing a black ball is then independent of the outcome of the coin flip; learning that the coin landed heads, does not change our beliefs about how likely the subsequent draw will result in a black ball. The probability table for this example is in Table 7.2.

Table 7.2: Joint probability table for a flip-and-draw scenario where the coin has a bias of \(0.8\) towards heads and where each of the two urns holds \(3\) black and \(7\) white balls.
	heads	tails	\(\Sigma\) rows
black	\(0.8 \times 0.3 = 0.24\)	\(0.2 \times 0.3 = 0.06\)	0.3
white	\(0.8 \times 0.7 = 0.56\)	\(0.2 \times 0.7 = 0.14\)	0.7
\(\Sigma\) columns	0.8	0.2	1.0

Independence shows in Table 7.2 in the fact that the probability in each cell is the product of the two marginal probabilities. This is a direct consequence of stochastic independence:

Proposition 7.1 (Probability of conjunction of stochastically independent events) For any pair of events \(A\) and \(B\) with non-zero probability:

\[P(A \cap B) = P(A) \ P(B) \, \ \ \ \ \text{[if } A \text{ and } B \text{ are stoch. independent]} \]

Proof. By assumption of independence, it holds that \(P(A \mid B) = P(A)\). But then:

\[ \begin{aligned} P(A \cap B) & = P(A \mid B) \ P(B) && \text{[def. of conditional probability]} \\ & = P(A) \ P(B) && \text{[by ass. of independence]} \end{aligned} \]

References

Halpern, Joseph Y. 2003. Reasoning about Uncertainty. MIT Press.

We also verbalize this as “the conditional probability of \(A\) conditioned on \(B\).”↩︎
Updating with events that have probability zero entails far more severe adjustments of the underlying belief system than just ruling out information hitherto considered possible. Formal systems that capture such belief revision are studied in formal epistemology. Halpern (2003) gives a good comprehensive treatment.↩︎
The terms prior and posterior make sense when we think about an agent’s belief state before (prior to) and after (posterior to) an observation.↩︎