
Contents |
For a given dataset
-- where X is the independent variable and Y is the dependent variable -- linear regression fits the data to a model of the following form:
|
|
(1) |
To fit the model, assume that the residuals
|
|
(2) |
conform to a normal (Gaussian) distribution with the mean equal to 0 and the variance equal to
. Then the maximum likelihood estimates of the parameters
and
can be obtained by minimizing the chi-square value, defined as:
|
|
(3) |
If the error is treated as weight, the chi-square minimizing equation can be written as:
|
|
(4) |
and
|
|
(5) |
where
are the measurement errors. If they are unknown, they should all be set to 1
The fit related formulas are summary here:
When x2 is minimized, the estimated parameters of linear model can be computed as:
|
|
(6) |
|
|
(7) |
where:
|
|
(8) |
and
|
|
(9) |
|
|
(10) |
| Note: When the intercept is excluded from the model, the coefficients are calculated using the uncorrected formula. |
For each parameter, the standard error can be obtained by:
|
|
(11) |
|
|
(12) |
where the sample variance
can be estimated as follows:
|
|
(13) |
And RSS means the residual sum of square (or error sum of square, SSE), which is actually the sum of the squares of the vertical deviations from each data point to the fitted line. It can be computed as:
|
|
(14) |
| Note : Regarding n * , if intercept is included in the model, n * = n - 1. Otherwise, n * = n. |
If the regression assumptions hold, we have:
|
|
(15) |
The t-test can be used to examine whether the fitting parameters are significantly different from zero, which means that we can test whether
(if true, this means that the fitted line passes through the origin) or
. The hypotheses of the t-tests are:
The t-values can be computed by:
|
|
(16) |
With the computed t-value, we can decide whether or not to reject the corresponding null hypothesis. Usually, for a given confidence level
, we can reject
when
. Additionally, the p-value, or significance level, is reported with a t-test. We also reject the null hypothesis
if the p-value is less than
.
The probability that
in the t test above is true.
|
|
(17) |
where tcdf(t, df) compute the lower tail probability for the Student't t distribution with df degree of freedom.
From the t-value, we can calculate the
confidence interval for the intercept
|
|
(18) |
And the
confidence interval for the slope is:
|
|
(19) |
The Confidence Interval Half Width is:
|
|
(20) |
where UCL and LCL is the Upper Confidence Interval and Lower Confidence Interval, respectively.
Some fit statistics formulas are summary here:
The Error degree of freedom. Please refer to the ANOVA table for more details.
The residual sum of squares, see formula (14).
The quality of linear regression can be measured by the coefficient of determination (COD), or R2, which can be computed as:
|
|
(21) |
where TSS is the total sum of square (Formula . & .), and RSS is the residual sum of square. The R2 is a value between 0 and 1. Generally speaking, if it is close to 1, the relationship between X and Y will be regarded as very strong and we can have a high degree of confidence in our regression model.
We can further calculate the adjusted R2 as
|
|
(22) |
The R value is the square root of R2:
|
|
(23) |
In simple linear regression, the correlation coefficient between x and y, denoted by r, equals to:
|
|
(24) |
Root mean square of the error, which equals to:
|
|
(25) |
Equals to square root of RSS:
|
|
(26) |
The ANOVA table of linear fitting is:
| df | Sum of Squares | Mean Square | F Value | Prob > F | |
|---|---|---|---|---|---|
| Model | 1 | SSreg = TSS - RSS | MSreg = SSreg / 1 | MSreg / MSE | p-value |
| Error | n* - 1 | RSS | MSE = RSS / (n* - 1) | ||
| Total | n* | TSS |
| Note: If intercept is included in the model, n*=n-1. Otherwise, n*=n and the total sum of squares is uncorrected. If the slope is fixed, dfModel = 0. |
Where the total sum of square, TSS, is:
|
(27) |
The Covariance matrix of linear regression is calculated by:
|
|
(28) |
The correlation between any two parameters is:
|
|
(29) |
For a particular value
, the
confidence interval for the mean value of
at
is:
|
|
(30) |
And the
prediction interval for the mean value of
at
is:
|
|
(31) |
Assuming the pair of variables (X, Y) conforms to a bivariate normal distribution, we can examine the correlation between the two variables using a confidence ellipse. The confidence ellipse is centered at (
,
), and the major semiaxis a and minor semiaxis b equal:
|
|
(32) |
For a given confidence level of
:
|
|
(33) |
|
|
(34) |
|
|
(35) |