In statistics and mathematics, linear least squares is a computational approach to fitting a mathematical or statistical model to data. It can be applied when the idealized value provided by the model for each data point is expressed linearly in terms of the unknown parameters of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system.
Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations, where the best approximation is defined as that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called "linear" least squares since the solution depends linearly on the data. Linear least squares problems are convex and have a closedform solution that is unique, except in special degenerate situations. In contrast, nonlinear least squares problems generally must be solved by an iterative procedure, and often are nonconvex with multiple local solutions.
In statistics, linear least squares is the computational basis for ordinary least squares analysis, which is one type of regression analysis.
As a result of an experiment, four (x,y) data points were obtained, (1,6), (2,5), (3,7), and (4,10) (shown in red in the picture on the right). It is desired to find a line y = β_{1} + β_{2}x that fits "best" these four points. In other words, we would like to find the numbers β_{1} and β_{2} that approximately solve the overdetermined linear system
of four equations in two unknowns in some "best" sense.
The least squares approach to solving this problem is to try to make as small as possible the sum of squares of "errors" between the right and lefthand sides of these equations, that is, to find the minimum of the function
The minimum is determined by calculating the partial derivatives of S(β_{1},β_{2}) with respect to β_{1} and β_{2} and setting them to zero. This results in a system of two equations in two unknowns, called the normal equations, which, when solved, gives the solution
and the equation y = 3.5 + 1.4x of the line of best fit. The residuals, that is, the discrepancies between the y values from the experiment and the y values calculated using the line of best fit are then found to be 1.1, − 1.3, − 0.7, and 0.9 (see the picture on the right). The minimum value of the sum of squares is S(3.5,1.4) = 1.1^{2} + ( − 1.3)^{2} + ( − 0.7)^{2} + 0.9^{2} = 4.2.
The common computational procedure to find a firstdegree polynomial function approximation in a situation like this is as follows.
Use for the number of data points.
Find the four sums: , , , and .
The calculations for the slope m (β_{2} in the previous example) and the yintercept b (β_{1}) are as follows.
Consider an overdetermined system
of m linear equations in n unknown coefficients, β_{1},β_{2},…,β_{n}, with m > n, written in matrix form as
where
Such a system usually has no solution, and the goal is then to find the coefficients β which fit the equations "best", in the sense of solving the quadratic minimization problem
A justification for choosing this criterion is given in properties below. This minimization problem has a unique solution, provided that the n columns of the matrix X are linearly independent, given by solving the normal equations
The primary application of linear least squares is in data fitting. Given a set of m data points consisting of experimentally measured values taken at m values of an independent variable (x_{i} may be scalar or vector quantities), and given a model function with it is desired to find the parameters β_{j} such that the model function fits "best" the data. In linear least squares, linearity is meant to be with respect to parameters β_{j}, so
Here, the functions φ_{j} may be nonlinear with respect to the variable x.
Ideally, the model function fits the data exactly, so
for all This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals
so to minimize the function
After substituting for r_{i} and then for f, this minimization problem becomes the quadratic minimization problem above with
and the best fit can be found by solving the normal equations.
S is minimized when its gradient with respect to each parameter is equal to zero. The elements of the gradient vector are the partial derivatives of S with respect to the parameters:
Since , the derivatives are
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives
Upon rearrangement, the normal equations
are obtained. The normal equations are written in matrix notation as
The solution of the normal equations yields the vector of the optimal parameter values.
A general approach to the least squares problem can be described as follows. Suppose that we can find a n by m matrix S such that XS is an orthogonal projection onto the image of X. Then a solution to our minimization problem is given by
simply because
is exactly a sought for orthogonal projection of onto an image of X ( see the picture below and note that as explained in the next section the image of X is just a subspace generated by column vectors of X). A few popular ways to find such a matrix S are described below.
The algebraic solution of the normal equations can be written as
where X^{+} is the Moore–Penrose pseudoinverse of X. Although this equation is correct, and can work in many applications, it is not computationally efficient to invert the normal equations matrix. An exception occurs in numerical smoothing and differentiation where an analytical expression is required.
If the matrix X^{T}X is wellconditioned and positive definite, that is, it has full rank, the normal equations can be solved directly by using the Cholesky decomposition R^{T}R, where R is an upper triangular matrix, giving:
The solution is obtained in two stages, a forward substitution step, solving for z:
followed by a backward substitution, solving for
Both substitutions are facilitated by the triangular nature of R.
See example of linear regression for a workedout numerical example with three parameters.
Orthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable, from not having to form the product X^{T}X.
The residuals are written in matrix notation as
The matrix X is subjected to an orthogonal decomposition; the QR decomposition will serve to illustrate the process.
where Q is an m×m orthogonal matrix and R is an m×n matrix which is partitioned into a n×n upper triangular block, R_{n}, and a (m − n)×n zero block 0.
The residual vector is leftmultiplied by Q^{T}.
Because Q is orthogonal, the sum of squares of the residuals, s, may be written as:
Since v doesn't depend on β, the minimum value of s is attained when the upper block, u, is zero. Therefore the parameters are found by solving:
These equations are easily solved as R_{n} is upper triangular.
An alternative decomposition of X is the singular value decomposition (SVD)^{[1]}
where U is m by m orthogonal matrix, V is n by n orthogonal matrix and Σ is an m by n matrix with all its elements outside of the main diagonal equal to 0. The (pseudo)inverse of Σ is easily obtained by inverting its nonzero diagonal elements. Hence,
where P is obtained from Σ by replacing its nonzero diagonal elements with ones. Since X and Σ are obviously of the same rank (one of the many advantages of singular value decomposition)
is an orthogonal projection onto the image (columnspace) of X and in accordance with a general approach described in the inroduction above,
is a solution of a least squares problem. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, X^{T}X, is very illconditioned (i.e. if its condition number multiplied by the machine's relative roundoff error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured using the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related to factor analysis.
The gradient equations at the minimum can be written as
A geometrical interpretation of these equations is that the vector of residuals, is orthogonal to the column space of X, since the dot product is equal to zero for any conformal vector, v. This means that is the shortest of all possible vectors , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.
If the experimental errors, , are uncorrelated, have a mean of zero and a constant variance, σ, the GaussMarkov theorem states that the leastsquares estimator, , has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a normal distribution. However, for some probability distributions, there is no guarantee that the leastsquares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.
For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the leastsquares estimator of the value of that quantity. If the conditions of the GaussMarkov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.
However, in the case that the experimental errors do belong to a normal distribution, the leastsquares estimator is also a maximum likelihood estimator.^{[2]}
These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.
An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares also known as Errorsinvariables model, or Rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.^{[3]}^{[4]}
In some cases the (weighted) normal equations matrix X^{T}X is illconditioned; this occurs when the measurements have only a marginal effect on one or more of the estimated parameters.^{[5]} In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called Tikhonov regularization. If further information about the parameters is known, for example, a range of possible values of , then various techniques can be used to increase the stability of the solution (see constrained least squares).
Another drawback of the least squares estimator is the fact that the norm of the residuals, is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter , e.g., a small value of . However, since is unknown, this quantity cannot be directly minimized. If a prior probability on is known, then a Bayes estimator can be used to minimize the mean squared error, . The least squares method is often applied when no prior is known. Surprisingly, however, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James–Stein estimator.
In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:
where w_{i} > 0 is the weight of the ith observation, and W is the diagonal matrix of such weights.
The weights should, ideally, be equal to the reciprocal of the variance of the measurement.^{[6]} The normal equations are then:
This method is used in iteratively reweighted least squares.
The estimated parameter values are linear combinations of the observed values
Therefore an expression for the residuals (i.e. the estimated errors in the observations) can be obtained by error propagation from the errors in the observations. Let the variancecovariance matrix for the observations be denoted by M and that of the parameters by M^{β}. Then,
When W = M^{ −1} this simplifies to
When unit weights are used (W = I) it is implied that the experimental errors are uncorrelated and all equal: M = σ^{2}I, where σ^{2} is the variance of an observation, and I is the identity matrix. In this case σ^{2} is approximated by , where S is the minimum value of the objective function
In all cases, the variance of the parameter β_{i} is given by and the covariance between parameters β_{i} and β_{j} is given by . Standard deviation is the square root of variance and the correlation coefficient is given by . These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are always correlated.
It is often assumed, for want of any concrete evidence, that the error on an observation belongs to a normal distribution with a mean of zero and standard deviation σ. Under that assumption the following probabilities can be derived.
The assumption is not unreasonable when m >> n. If the experimental errors are normally distributed the parameters will belong to a Student's tdistribution with m − n degrees of freedom. When m >> n Student's tdistribution approximates to a Normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.^{[7]}
When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.
The residuals are related to the observations by
where H is the symmetric, idempotent matrix known as the hat matrix:
and I is the identity matrix. The variancecovariance matrice of the residuals, M^{r} is given by
Thus the residuals are correlated, even if the observations are not.
The sum of residual values is equal to zero whenever the model function contains a constant term. Leftmultiply the expression for the residuals by X^{T}:
Say, for example, that the first term of the model is a constant, so that X_{i1} = 1 for all i. In that case it follows that
Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero it is not accidental but is a consequence of the presence of the constant term, α, in the model.
If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals,^{[8]} but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's tdistribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.
The objective function can be written as
since (I – H) is also symmetric and idempotent. It can be shown from this,^{[9]} that the expected value of S is mn. Note, however, that this is true only if the weights have been assigned correctly. If unit weights are assumed, the expected value of S is (m − n)σ^{2}, where σ^{2} is the variance of an observation.
If it is assumed that the residuals belong to a Normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a Chisquare (χ^{2}) distribution with mn degrees of freedom. Some illustrative percentile values of χ^{2} are given in the following table.^{[10]}
mn  

10  9.34  18.3  23.2 
25  24.3  37.7  44.3 
100  99.3  124  136 
These values can be used for a statistical criterion as to the goodnessoffit. When unit weights are used, the numbers should be divided by the variance of an observation.
Often it is of interest to solve a linear least squares problem with an additional constraint on the solution. With constrained linear least squares, the original equation
must be satisfied (in the least squares sense) while also ensuring that some other property of is maintained. There are often special purpose algorithms for solving such problems efficiently. Some examples of useful constraints are given below:
1. Free and opensource, with OSIApproved licenses
Name  License  Brief info 

bvls  BSD  Fortran code by Robert L. Parker & Philip B. Stark 
LAPACK dgelss dgelsd  BSD  Made by Univ. of Tennessee, Univ. of California Berkeley, NAG
Ltd., Courant Institute, Argonne National Lab, and Rice University 
OpenOpt  BSD  Universal crossplatform Pythonwritten numerical optimization
framework; see its linear least squares problem page and full list of problems 
GNU Octave  GPL  High level language frontend to numerical libraries developed by Univ. WisconsinMadison 
2. Commercial
Theory
Online utilities

