C.2 The Maximum Entropy Principle
C.2.1 Information Entropy
Entropy is a measure of information content of an outcome of \(X\). A less probable outcome conveys more information than more probable ones. Thus, entropy can be stated as a measure of uncertainty. When the goal is to find a distribution that is as ignorant as possible, then, consequently, entropy should be maximal. Formally, entropy is defined as follows: If \(X\) is a discrete random variable with distribution \(P(X=x_i)=p_i\), then the entropy of \(X\) is \[H(X)=-\sum_{i} p_i \log p_i.\]
If \(X\) is a continuous random variable with probability density \(p(x)\) then the differential entropy of \(X\) is \[H(X)=-\int_{-\infty}^{+\infty} p(x) \log p(x) dx.\]
From which considerations is this entropy definition derived? There exist various approaches that finally come to the same answer: the above-stated definition of entropy. However, the most cited derivation is Shannon’s theorem. Another and perhaps more intuitive derivation is Wallis’ derivation. Jaynes (2003) describes both approaches in detail. The following provides a short insight into both derivations and is taken from Jaynes (2003).
C.2.1.1 Shannon’s theorem
Shannon’s approach starts by stating conditions that a measure of the amount of uncertainty \(H_n\) has to satisfy.
- It is possible to set up some kind of association between the amount of uncertainty and real numbers.
- \(H_n\) is a continuous function of \(p_i\). Otherwise, an arbitrarily small change in the probability distribution would lead to a big change in the amount of uncertainty.
- \(H_n\) should correspond to common sense in that, when there are many possibilities, we are more uncertain than when there are a few. This condition has the effect that in case the \(p_i\) are all equal, the quantity \(h(n)\) is a monotonic increasing function of \(n\).
- \(H_n\) is consistent in that, when there is more than one way of working out its value, we must get the same answer.
Under these assumptions, the resulting unique measure of uncertainty of a probability distribution \(p\) turns out to be just the average log-probability:
\[H(p)=-\sum_i p_i \log(p_i).\] Accepting this interpretation of entropy, it follows that the distribution \((p_1,...,p_n)\) which maximizes the above equation, subject to constraints imposed by the available information, will represent the most honest description of what the model knows about the propositions \((A_1,...,A_n)\).
The function \(H\) is called the entropy, or the information entropy of the distribution \(\{p_i\}\).
C.2.1.2 The Wallis derivation
A second and perhaps more intuitive approach to deriving entropy was suggested by G. Wallis. We are given information \(I\), which is to be used in assigning probabilities \(\{p_1,...,p_m\}\) to \(m\) different probabilities. We have a total amount of probability
\[\sum_{i=1}^{m} p_i =1\]
to allocate among them.
The problem can be stated as follows. Choose some integer \(n>>m\), and imagine that we have \(n\) little quanta of probabilities, each of magnitude \(\delta=\frac{1}{n}\), to distribute in a way we see fit.
Suppose we were to scatter these quanta at random among the \(m\) choices (penny-pitch game into \(m\) equal boxes). If we simply toss these quanta of probability at random, so that each box has an equal probability of getting them, nobody can claim that any box is being unfairly favored over any other.
If we do this and the first box receives exactly \(n_1\) quanta, the second \(n_2\) quanta etc., we will say the random experiment has generated the probability assignment:
\[p_i=n_i\delta=\frac{n_i}{n}, \textrm{ with } i=1,2,...,m.\]
The probability that this will happen is the multinomial distribution:
\[m^{-n} \frac{n!}{n_1!\cdot...\cdot n_m!}.\]
Now imagine that we repeatedly scatter the \(n\) quanta at random among the \(m\) boxes. Each time we do this we examine the resulting probability assignment. If it happens to conform to the information \(I\), we accept it; otherwise, we reject it and try again. We continue until some probability assignment \(\{p_1,...,p_m\}\) is accepted.
What is the most likely probability distribution to result from this game? It is the one which maximizes
\[W=\frac{n!}{n_1! \cdot ... \cdot n_m!}\]
subject to whatever constraints are imposed by the information \(I\).
We can refine this procedure by using smaller quanta, i.e., large \(n\). By using Stirling’s approximation
\[n!\sim \sqrt{(2\pi n)} \left(\frac{n}{e}\right)^n,\] and taking the logarithm from it
\[\log(n!) \sim \sqrt{(2\pi n)}+n\log\left(\frac{n}{e}\right),\] we have
\[\log(n!) \sim \sqrt{(2\pi n)}+n\log(n) - n.\]
Taking furthermore, also the logarithm from \(W\) and substituting \(\log(n!)\) by Stirling’s approximation, finally gives the definition of information entropy, as derived by Shannon’s theorem:
\[\frac{1}{n} \log(W) \rightarrow -\sum_{i=1}^{m}p_i\log(p_i)=H(p_1,...,p_m).\]
To sum it up: Entropy is a measure of uncertainty. The higher the entropy of a random variable \(X\), the more uncertainty it incorporates. When the goal is to find a maximal ignorance distribution, this goal can be consequently translated into a maximization problem: Find the distribution with maximal entropy subject to existing constraints. This will be the topic of the next section.
C.2.2 Deriving Probability Distributions using the Maximum Entropy Principle
The maximum entropy principle is a means of deriving probability distributions given certain constraints and the assumption of maximizing entropy. One technique for solving this maximization problem is the Lagrange multiplier technique.
C.2.2.1 Lagrangian multiplier technique
Given a multivariable function \(f(x,y,...)\) and constraints of the form \(g(x,y,...)=c\), where \(g\) is another multivariable function with the same input space as \(f\) and \(c\) is a constant:
In order to minimize (or maximize) the function \(f\) consider the following steps, assuming \(f\) to be \(f(x)\):
- Introduce a new variable \(\lambda\), called Lagrange multiplier, and define a new function \(\mathcal{L}\) with the form:
\[\mathcal{L}(x,\lambda)=f(x)+\lambda (g(x)-c).\]
- Set the derivative of the function \(\mathcal{L}\) equal to zero:
\[\mathcal{L'}(x,\lambda)=0,\]
in order to find the critical points of \(\mathcal{L}\).
- Consider each resulting solution within the limits of the made constraints and derive the resulting distribution \(f\), which gives the minimum (or maximum) one is searching for.
For more details, see Academy (2019)
C.2.2.2 Example 1: Derivation of maximum entropy pdf with no other constraints
For more details, see Finlayson (2017) and Keng (2017)
Suppose a random variable for which we have absolutely no information on its probability distribution, besides the fact that it should be a pdf and thus, integrate to 1. We ask for the following:
What type of probability density distribution gives maximum entropy when the random variable is bounded by a finite interval, say \(a\leq X \leq b\)? (Reza 1994)
We assume that the maximum ignorance distribution is the one with maximum entropy. It minimizes the prior information in a distribution and is therefore the most conservative choice.
For the continuous case entropy, the measure of uncertainty, is defined as
\[H(x)=-\int_{a}^{b}p(x) \log(p(x))dx,\]
with subject to the mentioned constraint that the sum of all probabilities is one (as it is a pdf):
\[\int_{a}^{b}p(x)dx =1.\]
Rewriting this into the form of a Lagrangian equation gives
\[\mathcal{L}=-\int_{a}^{b}p(x) \log(p(x))dx + \lambda \left(\int_{a}^{b}p(x)dx-1 \right).\]
The next step is to minimize the Lagrangian function. To solve this, we have to use the calculus of variations (Keng 2017).
First differentiating \(\mathcal{L}\) with respect to \(p(x)\):
\[\frac{\partial \mathcal{L}}{\partial p(x)}=0,\] \[-1-\log(p(x))+\lambda=0,\] \[p(x)=e^{(\lambda-1)}.\]
Second, the result of \(p(x)\) has to satisfy the stated constraint:
\[\int_{a}^{b} p(x)dx=1,\]
\[\int_{a}^{b} e^{1-\lambda} dx=1.\]
Solving this equation with respect to \(\lambda\) gives:
\[\lambda=1-\log\left(\frac{1}{b-a}\right).\]
Taking both solutions together, we get the following probability density function:
\[p(x)=e^{(1-\lambda)}=e^{\left(1-\left(1-\log\left(\frac{1}{b-a}\right)\right)\right)},\]
\[p(x)= \frac{1}{b-a}.\]
And this is the uniform distribution on the interval \([a,b]\). The answer to the above question is:
The maximum entropy distribution is associated with a random variable that is distributed as a uniform probability density distribution between \(a\) and \(b\).
This should not be too unexpected. It is quite intuitive that a uniform distribution is the maximal ignorance distribution (when no other constraints were made). The next example will be more exciting.
C.2.2.3 Example 2: Derivation of maximum entropy pdf with given mean \(\mu\) and variance \(\sigma^2\)
Suppose a random variable \(X\) with a preassigned standard deviation \(\sigma\) and mean \(\mu\). Again the question is: Which function \(p(x)\) gives the maximum of the entropy \(H(x)\)?
The Maximum Entropy is defined for the current case as
\[H(X)=-\int_{-\infty}^{\infty} p(x) \log p(x)dx,\]
is subject to the constraint that it should be a pdf
\[\int_{-\infty}^{\infty} p(x)dx = 1,\] and that \(\mu\) and \(\sigma\) are given (whereby only one constrained is needed, as the \(\mu\) is already included in the definition of \(\sigma\)):
\[\int_{-\infty}^{\infty}(x-\mu)^2 p(x) dx = \sigma^2.\]
Accordingly to the above mentioned technique the formulas are summarized in form of the Lagrangian equation:
\[\mathcal{L}= -\int_{-\infty}^{\infty} p(x) \log p(x)dx + \lambda_0\left(\int_{-\infty}^{\infty} p(x)dx - 1 \right) + \lambda_1\left(\int_{-\infty}^{\infty}(x-\mu)^2 p(x) dx - \sigma^2 \right).\]
Next, \(\mathcal{L}\) will be partially differentiated with respect to \(p(x)\):
\[\frac{\partial \mathcal{L}}{\partial p(x)}=0,\] \[-(1+\log p(x))+\lambda_0+\lambda_1 (x-\mu)^2=0,\]
\[p(x)=e^{\lambda_0+\lambda_1 (x-\mu)^2-1}.\]
Further, we have to make sure that the result holds for the stated constraints:
\[\int_{-\infty}^{\infty} e^{\lambda_0+\lambda_1 (x-\mu)^2-1}-1 dx = 1,\]
and
\[\int_{-\infty}^{\infty}(x-\mu)^2 e^{\lambda_0+\lambda_1 (x-\mu)^2-1} dx = \sigma^2.\]
For the first constraint, we get
\[e^{\lambda_0-1} \sqrt{-\frac{\pi}{\lambda_1}} = 1,\]
and for the second constraint
\[e^{\lambda_0-1} = \sqrt{\frac{1}{2\pi}} \frac{1}{\sigma},\]
Thus,
\[\lambda_1=\frac{-1}{2\sigma^2}.\]
Taking all together, we can write:
\[p(x)=e^{\lambda_0+\lambda_1 (x-\mu)^2-1}=e^{\lambda_0-1}e^{\lambda_1 (x-\mu)^2},\]
substituting the solutions for \(e^{\lambda_0-1}\) and \(\lambda_1\):
\[p(x)= \sqrt{\frac{1}{2\pi}} \frac{1}{\sigma} e^{\frac{-1}{2\sigma^2}(x-\mu)^2},\]
finally we can rearrange the terms a bit and get:
\[p(x)= \frac{1}{\sigma\sqrt{2\pi}}\exp{\left(\frac{-1}{2}\left(\frac{(x-\mu)^2}{\sigma^2}\right)\right)},\]
the Gaussian probability density distribution.
To sum it up:
If one is to infer a probability distribution given certain constraints, out of all distributions \(\{p_i\}\) compatible with them, one should pick the distribution \(\{p_i^*\}\) having the largest value of \(H\) (De Martino and De Martino 2018). In other terms, a Maximum Entropy distribution is completely undetermined by features that do not appear explicitly in the constraints subject to which it has been computed.
An overview of Maximum Entropy distributions can be found on Wikipedia.