In probability and statistics, an exponential family is a class of probability distributions sharing a certain form, specified below. It is said that such distributions belong to the exponential class of density functions. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural distributions to consider. The concept of exponential families is credited to[1] E. J. G. Pitman,[2] G. Darmois,[3] and B. O. Koopman[4] in 1935–6.
Contents |
The following is a sequence of increasingly general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

where T(x), h(x), η(θ), and A(θ) are known functions.
The value θ is called the parameter of the family.
Note that x is often a vector of measurements, in which case T(x) is a function from the space of possible values of x to the real numbers.
If η(θ) = θ, then the exponential family is said to be in canonical form. By defining a transformed parameter η = η(θ), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, while T(x) is multiplied by its inverse.
Further down the page is the example of a normal distribution with unknown mean and known variance.
The single-parameter definition can be extended to a vector
parameter
.
A family of distributions is said to belong to a vector exponential
family if the probability density function (or probability mass
function, for discrete distributions) can be written as

As in the scalar valued case, the exponential family is said to
be in canonical form if
,
for all i.
A vector exponential family is said to be curved if the
dimension of
is less than the dimension of the vector
.
That is, if the dimension of the parameter vector is less
than the number of functions of the parameter vector in
the above representation of the probability density function.
Further down the page is the example of a normal distribution with unknown mean and variance.
We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.
Suppose H is a non-decreasing function of a real variable and H(x) approaches 0 as x approaches −∞. Then Lebesgue–Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.
Any member of that exponential family has cumulative distribution function

If F is a continuous distribution with a density, one can write dF(x) = f(x) dx.
H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then H is a step function (with steps on the support of F).
In the definitions above, the functions T(x), η(θ), and A(θ) were arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.
The normal, exponential, gamma, chi-square, beta, Weibull (if the shape parameter is known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known parameter r), and geometric distributions are all exponential families. The family of Pareto distributions with a fixed minimum bound form an exponential family.
The Cauchy and uniform families of distributions are not exponential families. The Weibull distribution is not an exponential family unless the shape parameter is known. The Laplace family is not an exponential family unless the mean is zero.
Following are some detailed examples of the representation of some useful distribution as exponential families.
As a first example, consider a random variable distributed normally with unknown mean μ and variance 1. The probability density function is then

This is a scalar exponential family in canonical form, as can be seen by setting




Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

This is an exponential family which can be written in canonical form by defining




As an example of a discrete exponential family, consider the binomial distribution. The probability mass function for this distribution is

This can equivalently be written as

which shows that the binomial distribution is an exponential family, whose natural parameter is

As mentioned above,
is the cumulant generating function for
.
A consequence of this is that one can fully understand the mean and
covariance structure of
by differentiating
.

and

The first two raw moments and all mixed moments can be recovered
from these two identities. This is often useful when
is a complicated function of the data whose moments are difficult
to calculate by integration. As an example consider a real valued
random variable
with density

indexed by shape parameter
(this distribution is called the skew-logistic). The density can be
rewritten as

Notice this is an exponential family with canonical parameter
sufficient statistic
and normalizing factor
So using the first identity,
![E(\mathrm{log}(1 + e^{-X})) = E(T) = \frac{ \partial A(\eta) }{ \partial \eta } = \frac{ \partial }{ \partial \eta } [-\mathrm{log}(-\eta)] = \frac{1}{-\eta} = \frac{1}{\theta},](http://images-mediawiki-sites.thefullwiki.org/11/3/8/5/62267353915519387.png)
and using the second identity
![\mathrm{var}(\mathrm{log}(1 + e^{-X})) = \frac{ \partial^{2} A(\eta) }{ \partial \eta^{2} } = \frac{ \partial }{ \partial \eta } \left[\frac{1}{-\eta}\right] = \frac{1}{(-\eta)^{2}} = \frac{1}{\theta^2}.](http://images-mediawiki-sites.thefullwiki.org/00/4/2/8/13926641118483769.png)
This example illustrates a case where using this method is very simple, but the brute force calculation would be nearly impossible.
The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values?
The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.
The entropy of dF(x) relative to dH(x) is
![S[dF|dH]=-\int {dF\over dH}\ln{dF\over dH}\,dH](http://images-mediawiki-sites.thefullwiki.org/08/2/3/0/05599201315181105.png)
or
![S[dF|dH]=\int\ln{dH\over dF}\,dF](http://images-mediawiki-sites.thefullwiki.org/08/3/6/9/33571524051347661.png)
where dF/dH and dH/dF are Radon–Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely

assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I.
Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.
The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.
For examples of such derivations, see Maximum entropy probability distribution.
According to the Pitman-Koopman-Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Less tersely, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior π for the parameter η of an exponential family is given by

where
and β > 0 are hyperparameters
(parameters controlling parameters).
A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.
An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
The one-parameter exponential family has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that η(θ) is non-decreasing. As a consequence, there exists a uniformly most powerful test for testing the hypothesis H0: θ ≥ θ0 vs. H1: θ < θ0.
The exponential family forms the basis for the distribution function used in generalized linear models, a class of model that encompass many of the commonly used regression models in statistics.
|
|