Linear model
From Wikipedia, the free encyclopedia
| It has been suggested that this article or section be merged into Linear regression . (Discuss) |
In statistics, given a (random) sample
the most general form of linear model is formulated as
where
may be nonlinear functions.
In matrix notation this model can be written as
where Y is an n × 1 column vector, X is an n × (p + 1) matrix, β is a (p + 1) × 1 vector of (unobservable) parameters, and ε is an n × 1 vector of errors, which are uncorrelated random variables each with expected value 0 and variance σ2. Note that depending on the context the sample can be seen as fixed (observable), or random.
Much of the theory of linear models is associated linear regression, especially with estimating the values of the parameters β and σ2. Typically, this is done using the least-squares method, which has minimum variance among mean-unbiased estimators, according to the Gauss-Markov theorem.
If the error is known to follow a normal distribution, then the method of maximum likelihood can be used, which agrees with the linear least squares method for estimating the mean parameter. The methods disagree for estimating the variance, for which the maximum-likelihood method is biased.
Contents |
[edit] Methods of inference
[edit] Maximum likelihood estimation of coefficients
[edit] Multivariate normal errors
The method of maximum-likelihood requires that the statistician know a parametric family of probability distributions. The most common assumption takes the components of the vector of errors to be independent and normally distributed, giving Y a multivariate normal distribution with mean Xβ and co-variance matrix σ2 I, where I is the identity matrix.
Having observed the values of X and Y, the statistician must estimate β and σ2.
[edit] β
The log-likelihood function (for εi independent and normally distributed) is
where
is the ith row of X. Differentiating with respect to βj, we get
so setting this set of p equations to zero and solving for β gives
Now, using the assumption that X has rank p, we can invert the matrix on the left hand side to give the maximum likelihood estimate for β:
.
We can check that this is a maximum by looking at the Hessian matrix of the log-likelihood function.
[edit] σ2
By setting the right hand side of
to zero and solving for σ2 we find that
[edit] Maximum likelihood estimation is biased
Since we have that Y follows a multivariate normal distribution with mean Xβ and co-variance matrix σ2 I, we can deduce the distribution of the MLE of β:
So this estimate is unbiased for β, and we can show that this variance achieves the Cramér-Rao bound.
A more complicated argument[1] shows that
since a chi-squared distribution with n − p degrees of freedom has mean n − p, which is biased. However, for large samples, this bias is small and as the sample-size grows the bias approaches zero.
[edit] Best linear unbiased estimation (BLUE)
The least-squares estimator is often used to estimate the coefficients of a linear regression. The least-squares method minimizes the sum of the square of the residuals.
We conclude by giving some qualities of this estimator and a geometrical interpretation.
[edit] Assumptions
For
, let Y be a random variable taking values in
, we call observation.
We next define the function η, linear in θ:
where
- For
, Xj is a random variable taking values in
and is called a factor and - θj is a scalar, for
, and
, where θt denotes the transpose of vector θ.
Let
. We can write η(X;θ) = Xtθ. Define the error to be:
We suppose that there exists a true parameter
such that
. This means that, given the random variables
, the best prediction we can make of Y is
. Henceforth,
will denote
and η will represent
.
[edit] Least-squares estimator
The idea behind the least-squares estimator is to see linear regression as an orthogonal projection. Let F be the L2-space of all random variables whose square has a finite Lebesgue integral. Let G be the linear subspace of F generated by
(supposing that
and
). We show in this paragraph that the function η is an orthogonal projection of Y on G and we will construct the least-squares estimator.
[edit] Seeing linear regression as an orthogonal projection
We have
, but
is a projection, which means that η is a projection of Y on G. What is more, this projection is an orthogonal one.
To see this, we can build a scalar product in F: for all couples of random variables
, we define
. It is indeed a scalar product because if
, then X = 0 almost everywhere (where
is the norm corresponding to this scalar product).
For all
,
Therefore,
is orthogonal to any Xj and hence to the whole of the subspace G, which means that η is a projection of Y on G, orthogonal with respect to the scalar product we have just defined. We have therefore shown:
[edit] Estimating the coefficients
If, for each
we have a sample of size
of Xj, along with a vector
of n observations of Y, we can build an estimation of the coefficients of this orthogonal projection. To do this, we can use an estimation of the scalar product defined earlier.
For all couples of samples of size n
of random variables U and V, we define
, where
is the transpose of vector
, and
. Note that the scalar product
is defined in Fn and no longer in F.
Let us define the design matrix (or random design), a
random matrix:![\mathbf{X}=\left[\begin{matrix}X_1^1&\cdots&X^1_p\\\vdots&&\vdots\\X^n_1&\cdots&X^n_p\end{matrix}\right]](http://upload.wikimedia.org/math/2/a/5/2a51502c1ca0190465e3f53c9a95c89c.png)
We can now adapt the minimization of the sum of the residuals: the least-squares estimator
will be the value, if it exists, of θ which minimizes
. Therefore,
.
This yields
. If
is of full rank, then so is
. In that case we can compute the least-squares estimator explicitly by inverting the
matrix
:

[edit] Qualities and geometrical interpretation
[edit] Qualities of this estimator
The Gauss-Markov theorem states that the least-square estimators is the best linear unbiased estimator (BLUE) of
.
The vector of errors
is said to fulfil the Gauss-Markov assumptions if:
-

(uncorrelated but not necessarily independent; homoscedastic but not necessarily identically distributed)
where
and
is the
identity matrix.
This decisive advantage has led to a sometimes abusive use of least-squares. Least-squares depends on the fulfilment of the Gauss-Markov hypothesis and applying this method in a situation where these conditions are not met can lead to inaccurate results. For example, in the study of time-series, it is often difficult to assume independence of the residuals.
[edit] Geometrical interpretation
The situation described by the linear regression problem can be geometrically seen as follows:

The least-squares is also an M-estimator of ρ-type for
.
[edit] Generalizations
[edit] Generalized least squares
If, rather than taking the variance of ε to be σ2I, where I is the n×n identity matrix, one assumes the variance is Ω, where Ω is a known matrix other than the identity matrix, then one estimates β by the method of "generalized least squares", in which, instead of minimizing the sum of squares of the residuals (the squared euclidean length of the residual), one minimizes the squared Mahalanobis Length of the residual vector:
This has the effect of "de-correlating" normal errors, and leads to the estimator
which is the best linear unbiased estimator for β. If all of the off-diagonal entries in the matrix Ω are 0, then one normally estimates β by the method of weighted least squares, with weights proportional to the reciprocals of the diagonal entries. The GLS estimator is also known as the Aitken estimator, after Alexander Aitken, the Professor in the University of Otago Statistics Department who pioneered it.[2]
[edit] Generalized linear models
Generalized linear models, for which rather than
- E(Y) = Xβ,
one has
- g(E(Y)) = Xβ,
where g is the "link function". The variance is also not restricted to being normal.
An example is the Poisson regression model, which states that Yi has a Poisson distribution with expected value
. The link function is the natural logarithm function. Having observed xi and Yi for i = 1, ..., n, one can estimate β0 and β1 by the method of maximum likelihood.
[edit] References
- ^ A.C. Davidson Statistical Models. Cambridge University Press (2003).
- ^ Alexander Craig Aitken
[edit] See also
- ANOVA, or analysis of variance, is historically a precursor to the development of linear models. Here the model parameters themselves are not computed, but X column contributions and their significance are identified using the ratios of within-group variances to the error variance and applying the F test.
- Linear regression
- Robust regression
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||











![\begin{align}
\langle X_j,\varepsilon \rangle_2 & = \langle X_j,Y-X^t \overline{\theta}\rangle_2 \\
& =\langle X_j,Y\rangle_2-\langle X_j,\mathbb{E}[Y|X]\rangle_2 \\
& =\mathbb{E}[X_j Y] - \mathbb{E}[X_j \mathbb{E}[Y|X]] \\
& =X_j(\mathbb{E}Y-\mathbb{E}[\mathbb{E}[Y|X]]) \\
& =X_j(\mathbb{E}Y - \mathbb{E}Y) \\
\langle X_j,\varepsilon \rangle_2 & =0
\end{align}](http://upload.wikimedia.org/math/5/f/4/5f4aa1b75d33314f3e724f80dfd7b039.png)




