Chapter 6 Linear multiple regression models
6.1 Introduction
In multiple regression analysis we consider situations where several independent (\(X\)) variables simultaneously affect a single dependent variable (\(Y\)). With this section and its extensions, we will learn about an extremely powerful tool for modeling, analysing and understanding how variables (for example in economics) act on each other. The range of applications goes way beyond economics, it can be any field of interest where random variables are used to represent quantities. We will start looking at linear models. In following chapters we will see that much more general models can be used in extending the basic linear framework in simple ways.
A typical multiple regression model looks like \[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \mathcal{E}\;. \tag{6.1} \end{equation}\] This model represents \(Y\) as a linear function of three \(X\) variables plus an error term. The interpretation of the error term is the same as before, it represents all unknown factors that affect \(Y\), in addition to the assumed effects captured by the dependency on \(X_1, X_2, X_3\).
Even though our models are linear at first, we will see that nonlinear effects can easily be included as well. The models can in principle have any number of \(X\) variables.
A few examples of variables that are functions of several other variables/factors:
Example 6.1
The demand for airline tickets is a function of price, prices of competing services, advertising efforts etc.
The price of a share of the Yara stock is a function of gas prices, agriculture product prices and raw material prices. (Yara is a large producer of fertilizers - http://www.yara.com/)
The duration of a intracity delivery route is a function of the number of stops, the traffic intensity and the length of the route.
With a representative sample of data on all the relevant variables, coefficient estimates can be obtained for the model in equation (6.1).
The basic purposes of using multiple regression models are
- To understand relations between variables. For example: In a passenger transport market, is there a significant relation between train ticket prices and airline travel demand? If so, how much will air travel demand increase if train ticket prices increase by 5% ?
- To predict values or changes in a dependent variable, based on changes or values of independent variables. For example, given airline ticket prices, prices of competing services and other relevant factors, what is our best prediction of airline ticket sales?
Clearly, when the estimated model is available, we have specific numerical estimates \(b_i\) for each coefficient, and with given values for the independent variables \(x_i\), we can produce the predicted value for the dependent variable as \[\begin{equation} \hat{y} = b_0 + b_1 x_1 +b_2 x_2 + \cdots + b_k x_k \;, \tag{6.2} \end{equation}\] for a model with \(k\) independent variables.
6.2 Parameter estimation and model assumptions for multiple regression.
We will now discuss briefly the idea behind parameter estimation in multiple regression, then show a basic example using R. We then proceed to discuss model assumptions.
6.2.1 Parameter estimation
To maximize the readability, let us stick to a case where we have three independent variables, \(X_1, X_2, X_3\). We want to estimate parameters for the model as written in equation (6.1). For that purpose we need data, which will be a set of observations of all variables together. In tabulated form, data would appear as
\(Y\) | \(X_1\) | \(X_2\) | \(X_3\) | |
---|---|---|---|---|
\(y_1\) | \(x_{1,1}\) | \(x_{2,1}\) | \(x_{3,1}\) | |
\(y_2\) | \(x_{1,2}\) | \(x_{2,2}\) | \(x_{3,2}\) | |
\(y_3\) | \(x_{1,3}\) | \(x_{2,3}\) | \(x_{3,3}\) | |
. | . | . | . | |
. | . | . | . | |
\(y_n\) | \(x_{1,n}\) | \(x_{2,n}\) | \(x_{3,n}\);. |
Here, \(y_j\) is observation number \(j\) of the dependent variable \(Y\), while \(x_{i,j}\) represents observation number \(j\) for variable \(X_i\). In this case, \[i= 1, 2, 3 \quad \mbox{and} \quad j = 1, 2, 3, \ldots, n\;. \]
The estimated model would in this case look like \[ y = b_0 + b_1 x_1 +b_2 x_2 + b_3 x_3 \;. \]
The values for the estimates \(b_0, b_1, b_2, b_3\) are determined in the same way as in the single-variable regression, by minimizing the \(SSE\), i.e. the sum of squared errors for a given set of parameter values.
\[ SSE = \sum_j (y_j - (b_0 + b_1 x_{1,j} + b_2 x_{2,j} + b_3 x_{3,j}))^2 = \sum_j (y_j - \hat{y}_j)^2 \]
Obviously, such estimation will be done using software, e.g. R. Let us take a first look at an example of how we get the estimated model and
some interpretations from R. We will look at data for used cars, and see how the selling prices depend on certain characteristics like age, price as new, mileage. The data set is called used_cars.csv
and we want need to read it into R and have a first look
## price age agecat mileage statwag newprice region eu_time corros report
## 1 69.11825 17 3 226 0 310 0 3 2 0
## 2 56.30901 16 3 186 0 270 0 5 2 0
## 3 141.90664 7 1 43 1 230 0 22 1 0
## 4 262.93979 2 1 45 0 320 0 0 0 1
## 5 9.00000 17 3 183 0 200 0 10 2 0
## 6 110.20784 10 2 117 0 240 1 15 2 1
The prices are x1000 NOK, we will also be using the age
, mileage
(driven distance x1000km) and newprice
(x 1000 NOK) variables here. We leave the others for now. So we want to run a regression with price
as dependent variable and those three mentioned variables as independents (\(X\)). We will again be using the lm
function and an R formula. This is an expression of form y ~ x1 + x2 + x3
indicating the dependent and independent variables. In addition we need to specify the dataframe to be used. We like to save the result in a regression object, and then inspect this object with different functions. Goes like this:
#run regression, save result.
pricereg <- lm(price ~ age + mileage + newprice, data = cardata)
#look at estimated model
coef(pricereg)
## (Intercept) age mileage newprice
## 5.1487847 -10.2882781 -0.1285636 0.8956327
This means we estimated the following model.
\[ price = 5.15 - 10.29\cdot age - 0.129\cdot mileage + 0.896\cdot newprice \;.\]
So what do these estimates tell us? For one thing, we can interpret the estimates as marginal effects, so for example if two cars have the same mileage and newprice, one year of extra age will (on average) reduce the price by -10.29, i.e. a little bit more than 10 000 NOK. In other words, -10.29 is the estimated marginal effect of age
assuming all other factors in the model are fixed. So what is the effect of an added 1000 km of mileage? Or 100 000 km?
Another use of such an estimated model is to predict prices for cars not yet sold. So two cars with the following characteristics
## age mileage newprice
## 1 5 80 400
## 2 10 120 380
would have expected selling prices at approximately 301 000 and 227 000, as can be seen by inserting the \(x\) values in the estimated model, e.g.
\[\begin{align*} \hat{y} & = 5.15 - 10.29\cdot 5 - 0.129\cdot 80 + 0.896\cdot 400 \approx 301 \\ \hat{y} & = 5.15 - 10.29\cdot 10 - 0.129\cdot 120 + 0.896\cdot 380 \approx 227 \end{align*}\]
So already, we can see two distinct purposes of such a model, (i) to understand and quantify the relationship between variables, (ii) to exploit the relationship in making predictions.
Before going deeper into more details about R in regression, we need to discuss a couple of topics:
6.2.2 Model assumptions
The model with three independent variables would read in “data form” \[ Y_j = \beta_0 + \beta_1 X_{1,j} + \beta_2 X_{2,j} + \beta_3 X_{3,j} + \mathcal{E}_j\;. \]
The basic assumptions A - D are equivalent to the single variable case, while an additional assumption E needs to be added for the multiple regression model.
A. The data \((X_{1, j}, X_{2,j}, X_{3,j}, Y_j)\) represent a random sample from the population we study.
B. The error terms all have the same normal distribution with \(\mbox{E}{[\mathcal{E}_j]} = 0.\)
C. The error term is statistically independent of all \(X\) variables in the model.
D. For all \(j\), the standard deviation is the same, \(\mbox{Sd}{[\mathcal{E}_j]} = \sigma_e\).
E. No \(X\) variable can be a linear function of the other \(X\) variables.
Assumptions A - D are essentially the same as for the two-variable case. Assumption E is referred to as the “no multicollinearity” assumption. It states that none of the \(x\) variables can be completely expressed as a linear function of other \(x\) variables.
Consider the following example to see what happens if assumption E is violated.
:::{.example, name=“Breakdown of assumption E”}
Suppose we want to have a model with two \(x\) variables,
\[\begin{equation}
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \mathcal{E}\;.
\tag{6.3}
\end{equation}\]
Suppose that a perfect linear dependency exist between the \(X\) variables. Then for
some constants \(a ,b\), we have
\[ X_2 = a + b X_1 \;. \]
Suppose for simplicity that \(a = 2, b = 4\).
Then, insert this into equation (6.3). Then we get
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 (2 + 4 X_1) + \mathcal{E}. \]
Rearrange this to get
\[ Y = (\beta_0 + 2\beta_2) + (\beta_1 + 4 \beta_2)X_1 + \mathcal{E}\;. \]
What is the problem with this? The problem is that there are infinitely many
choices of parameter values \(\beta_1, \beta_2\) that leads to the same coefficient
for \(X_1\) in the resulting model.
E.g. \(\beta_ 1 = 400, \beta_2 = 25\) gives the same coefficient as \(\beta_1 = 0, \beta_ 2 =125\).
In other words, in the original formulation (6.3), we try to put weights
on two factors (\(X_1, X_2\)) for \(Y\) variation. The linear dependency means that \(X_1\) and \(X_2\)
are in fact identical in terms of variation, so it becomes impossible to say how much weight to
put on either. A meaningful estimation of parameters becomes impossible. We may say that the
information given by \(X_2\) is completely redundant, because all of the variation pattern is
already included via \(X_1\).
You will find that R will ignore such redundant variables if you try to include them in a
regression analysis. There will be NA
values for some coefficient estimates, as a sign that
you should check the correlation between the independent variables.
::: As the example shows, full correlation between \(x\) variables makes the model break down, and parameter estimation becomes impossible. In real life, in particular with economics-related variables, it is very common that some correlation exist between \(x\) variables. So what happens if for instance the correlation between \(X_1\) and \(X_2\) is say 0.70 or 70%? Then thinking as in the example above, we see that \(X_1\) already gives most of the information included by \(X_2\), but there is also some unique variation in \(X_2\) not seen in \(X_1\). When, in the original model (6.3), we try to explain the variation of \(Y\) related to the variation of \(X_1, X_2\), we end in some trouble because much of the variation pattern in these variables is common. We do not know “who to blame” for making \(Y\) vary. The practical consequence is that our parameter estimates becomes more uncertain, the more \(x\) variable correlation is present. It is often wise to combine highly correlated \(x\) variables into common representative variables. For example, if a model involves interest rates in three different Norwegian banks as three \(x\) variables, it will typically be better to combine into a single \(x\) variable, say an average rate among the three banks.
6.2.3 Omitted variable bias
A further consequence of correlated \(x\) variables - a consequence that we will see repeatedly in our work - is that coefficient estimates for a given \(X_i\) variable may change radically depending on whether other correlated variables are included or not. As an example, consider again the model for prices of used cars that we started with in section 6.2.1.
In our first modeling approach we assume that price \(Y\) is a function of \(X_1, X_2, X_3\) being in the same order: 1. the age when sold, 2. the mileage (driven distance) when sold. 3. the price of the car as new
Clearly the \(X_1, X_2\) variables would be highly correlated. If we include only \(X_2, X_3\), we would get a model A \[\begin{equation} Y = \beta_0 + \beta_1 X_2 + \beta_3 X_3 + \mathcal{E}\;. \tag{6.4} \end{equation}\] In this model, the \(X_2\) variable in a way plays two roles: Firstly it directly carries the information about mileage. Secondly, it indirectly carries information about age. So, the \(\beta_2\) coefficient here represents the combined effect of two price driving factors. On the other hand, consider the full model B \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \mathcal{E}\;. \] Here, age information is directly included. Then what does now \(\beta_2\) measure? It now measures the marginal effect of added mileage on cars of the same age. This is because age is explicitly included into the model, and the extra information in \(X_2\) is used to adjust for different mileages. One should note from this the fundamental fact:
The estimated effect of a variable in a multiple regression model may depend heavily on what other variables are included/excluded.
Therefore
The effect of a variable in a multiple regression model must always be understood and interpreted in light of the whole model, not in isolation.
E.g. when we speak of the estimated effect of mileage on prices of used cars, we must be clear about what we mean: (i): mileage as an indicator also of age (model A), or (ii): mileage as a corrective price factor to age (model B).
In fact, the coefficient \(\beta_2\) present in models A and B are two different coefficients, representing two quite different effects. Therefore, one should ideally mark them by different notation, e.g. \(\beta^A_2, \beta^B_2\). The same goes for other coefficients and the error term \(\mathcal{E}\) as well. It is however not customary in statistics texts to use such notation, so we must always be careful about what we mean when referring to coefficients.
When we leave out age
of the model and estimate \(\beta_2\) from model A we are likely to get a wrong estimate for the “true” marginal effect of mileage
. This kind of error in estimation is usually called omitted variable bias. It occurs when an omitted variable is (i) correlated to an independent variable and (ii) correlated with the dependent variable. We can specifically look at the difference in the estimates for mileage
since we already have the data. We can compare the two models as follows. We will use the stargazer
package to display results with confidence intervals for the parameters. (See section 5.8)
library(stargazer)
#model A
car_regA <- lm(price ~ mileage + newprice, data = cardata)
car_regB <- lm(price ~ age + mileage + newprice, data = cardata)
stargazer(car_regA, car_regB,
type = "text",
ci = TRUE,
column.labels = c("model A", "model B"))
##
## =======================================================================
## Dependent variable:
## ---------------------------------------------------
## price
## model A model B
## (1) (2)
## -----------------------------------------------------------------------
## age -10.288***
## (-10.950, -9.627)
##
## mileage -0.782*** -0.129***
## (-0.845, -0.719) (-0.176, -0.081)
##
## newprice 0.928*** 0.896***
## (0.859, 0.998) (0.870, 0.921)
##
## Constant -21.015* 5.149
## (-43.174, 1.145) (-3.051, 13.348)
##
## -----------------------------------------------------------------------
## Observations 143 143
## R2 0.905 0.988
## Adjusted R2 0.904 0.987
## Residual Std. Error 27.039 (df = 140) 9.792 (df = 139)
## F Statistic 668.698*** (df = 2; 140) 3,708.623*** (df = 3; 139)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The noteable difference here is of course the estimated mileage
coefficient in model A and B. One would say that the model A estimate is strongly affected by the ommission of variable age
i.e. there is strong omission bias here.
6.3 Some more on multiple regression using R - update
and predict
We have seen how to get going with multiple regression in R. We now turn to another example.
Let us consider the data with prices of sold flats. We want to look at models that can assist in forecasting prices of flats, given information on other variables, like size, distance to town center etc. That means we want regression models where \(Y\) is the selling price, and the \(x\) variables are other variables recorded in the data file. We should note that these data are recorded more than 10 years ago, so the prices here are not in any way reflecting todays price level. This is however unimportant for our purposes.
At this point we should only include \(x\) variables that come with a natural scale and order to them. We will not be able to model e.g. differences in price levels between the three towns at this moment. We will learn such methods in the next chapter.
Let’s read the data, and have a look. (As always the code below is adapted to a special file structure, so the file path “Data/flat_prices.csv” should be adapted to your own structure.)
## price area rooms standard situated town distcen age rent
## 1 1031 100 3 2 6 1 5 15 2051
## 2 1129 116 3 1 5 1 4 42 2834
## 3 1123 110 3 2 5 1 3 25 2468
## 4 607 59 2 3 5 1 6 25 1940
## 5 858 72 2 3 4 1 1 17 1611
## 6 679 64 2 2 3 1 3 17 2039
So, we see the price
x 1000 NOK, the area
in square meters, the number of rooms, measures of standard
and location (situated
). There is a categorical variable town
, there is the distance to city centre, the age, and the monthly fixed cost called rent
.
6.3.1 The update
function
We can try to play with a few models here. When working with data like this we often want to compare models with and without certain variables included. A useful tool in such cases is the update
function. So as we know, the lm
regression function creates a regression object. If we want to compare models that differ only with say one variable added, we can just update
an existing regression. We can show two equivalent R codes below.
#run a model A
flatregA <- lm(price ~ area + standard + situated + distcen, data = flatprices)
#run a modelB where also age is included,
flatregB <- lm(price ~ area + standard + situated + distcen + age, data = flatprices)
That code is completely equivalent to the following
#run a model A
flatregA <- lm(price ~ area + standard + situated + distcen, data = flatprices)
#update model A by adding age, assuming the same data set.
flatregB <- update(flatregA, . ~ . + age)
The idea is that we take the object flatregA
and modify by adding variable age
. So we update
flatregA
by the new formula . ~ . + age
. In this formula the .
means “copy what was originally there when estimating model A”. You are of course completely free to either use the first form where the whole formula is repeated or the update
form.
We can look at the results of the two models:
#print the regression results, showing only coefficient estimates
stargazer(flatregA, flatregB,
type = "text",
column.labels = c("modelA", "modelB"),
keep.stat=c("rsq", "n"),
single.row = TRUE,
report = "vc*")
##
## =========================================
## Dependent variable:
## ----------------------------
## price
## modelA modelB
## (1) (2)
## -----------------------------------------
## area 9.027*** 9.026***
## standard -3.544 -3.507
## situated 19.354*** 19.356***
## distcen -19.922*** -19.918***
## age -0.108
## Constant 80.101** 82.758**
## -----------------------------------------
## Observations 150 150
## R2 0.932 0.932
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What this means is essentially that age
(and standard
) are not significant predictors for price. (We will be more precise about this in the next section, but basically the “***” marking indicate strongly significant factors.)
6.3.2 The predict
function
We now turn briefly to the suggested purpose of these models, to predict prices given characteristics of flats in the market. We can obtain such predictions in different ways. Let us now focus on the model A above. Suppose we have the following data on three flats for sale.
## item area standard situated distcen
## 1 flat1 100 3 5 3.4
## 2 flat2 80 4 1 4.3
## 3 flat3 90 1 5 1.3
We want model A’s predicted (forecasted) prices for these three items. Of course, we could manually put each of the flats into the estimated equation and get the answers. R has a quicker solution. Put the new data into a data frame called e.g. selling
, then use the predict
function with the regression object for model A and the selling
dataframe. Like this.
#set up the data frame
selling <- data.frame(area = c(100, 80, 90), standard = c(3, 4, 1),
situated = c(5, 1, 5), distcen = c(3.4, 4.3, 1.3))
#compute predicted values:
predict(flatregA, selling)
## 1 2 3
## 1001.2044 721.7749 959.8579
So, we get predicted prices at about 1000, 720, and 960 (x 1000) NOK, based on model A. We will later see how to also get (say) 95% error margins for such forecasts.
Next we turn to an all-important topic in regression:
6.4 Inference in multiple regression models.
Following the notation from chapter 4, we would write the estimate for parameter \(\beta_i\) as \(b_i\), and its standard error as \(S_{b_i}\).
Inference about the regression coefficients \(\beta_i\) follow the same reasoning as for the simple regression model. The key is the sampling distribution of coefficient estimators \(b_i\), and the basic result is that the ratio \[ T = \frac{b_i - \beta_i}{S_{b_i}}\;, \] is again \(t\) distributed, now with degrees of freedom \(DF = n-k-1\), where \(n\) is the sample size, \(k\) is the number of independent (\(x\)) variables in the model.
The standard test of \[ H_0: \beta_i = 0 \ \mbox{vs.} \ H_1: \beta_i \ne 0 \] is always performed by R, and the resulting \(t\) statistic and corresponding P-values are shown in output.
The test statistic is as before \[ T = \frac{b_i}{S_{b_i}}\;, \] and it now has a t distribution with \(DF = n-k-1\).
We can link this to for example the model B that we used above for flat prices. According to the stargazer
output, where the P-values in the standard test is indicated by 0 - 3 stars, the variable age
is not considered significant as a price predictor. This actually means that we can not reject the null hypothesis that the true coefficient is 0.
What we would normally do in such case, is to rerun the regression without this variable. We have already done that, resulting in model A. There is in model A still a variable which appears to be non-significant, that is standard
we can also remove that and end with a model C. Now we can use summary
to really see the P-values and so on for this regression.
#update model A by removing variable "standard".
flatregC <- update(flatregA, . ~ . - standard)
summary(flatregC)
##
## Call:
## lm(formula = price ~ area + situated + distcen, data = flatprices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.97 -54.41 3.64 53.37 157.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.8421 29.2524 2.524 0.0127 *
## area 9.0257 0.2078 43.435 < 2e-16 ***
## situated 19.0741 3.7464 5.091 1.08e-06 ***
## distcen -19.8060 3.3852 -5.851 3.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.59 on 146 degrees of freedom
## Multiple R-squared: 0.9321, Adjusted R-squared: 0.9307
## F-statistic: 668.4 on 3 and 146 DF, p-value: < 2.2e-16
We see now that each of the three involved predictors are significant.
In case there are more than one insignificant variable in an estimated model, caution should be exercised when “throwing out” \(x\) variables. If the variables are correlated, there may be several apparently insignificant variables. Removing one of them may change the picture for the others completely. Thus: When removing variables - take one at the time and rerun the regression. (Or use built-in procedures to assist.)
**With no additional complications, we can generalize the standard coefficient test setup to allow testing any hypothesis of the form \[ H_0: \beta_i = \beta^* \ \mbox{vs.} \ H_1: \beta_i \ne \beta^*\;, \] where again \(\beta^*\) is some constant value. These tests may also be one-sided. To perform this test, we use the test statistic \[ T = \frac{b_i - \beta^*}{S_{b_i}}\;. \] The null distribution is a \(t\) distribution with \(DF = n-k-1\). We note that the observed \(T\) value can be computed from R output figures, and that we often can approximate the \(t\) distribution with the \(N(0,1)\) distribution. So, as an example - if we denote the distance coefficient as \(\beta_3\) in the model C estimated above, we find \(b_3 = -19.81, S_{b_3} = 3.385\). If we wanted to test \[ H_0: \beta_3 = -30 \ \mbox{vs.} \ H_1: \beta_3 > -30\;, \] we would find the observed \(T\) value as \[ T_{OBS} = \frac{-19.81 - (- 30)}{3.385} = 3.01 \;. \] The corresponding P-value is about 0.0017, meaning we should reject the null hypothesis in this case.
In addition to testing hypotheses, we are often interested in confidence intervals for regression coefficients. In case of multiple regression, we have the following expression for the limits in a \((1 - \alpha)\cdot 100 \%\) confidence interval for \(\beta_i\).
\[\begin{equation} b_i \pm t^{^{n-k-1}}_{_{\alpha/2}} S_{b_i}\; . \tag{6.5} \end{equation}\] This expression is almost identical to equation (5.6), where the only difference is regarding the degrees of freedom for the t distribution.
6.4.1 Some model building considerations
In the search for the structure of a regression model explaining the relation between a dependent \(Y\) variable and a set of independent \(x\) variables, there are two main approaches.
- Use theory. For example, economic theory will often suggest the structure of dependency between certain variables, e.g. demand and prices.
- Exploration. In lack of appropriate suggestions from theory, one must try out reasonable model structures. Simplicity should always be a guideline when we search for a model in a more exploratory fashion.
6.4.2 Splitting variation for \(Y\)
Suppose we have estimated a model for a dependent variable, say \[ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 \;. \] As in the two-variable case, the total variation \(SST\) of the \(y\) data can be decomposed into two parts: (i) \(SSR\), measuring the variation related to \(x\) variable dependency, and (ii), \(SSE\) measuring the variation attributed to random effects. The splitting of variation then reads \[ SST = SSR + SSE \;.\] Now, a fundamental fact is that the splitting of variation, i.e. the sums \(SSR, SSE\) are totally dependent on the choice of \(x\) variables, while the total variation \(SST\) is a fixed value, only depending on the \(y\) data. We define the \(R^2\) measure as before, as the fraction of total variation that our chosen \(x\) variables can explain. I.e \[ R^2 = \frac{SSR}{SST}\;. \] In principle, we would like to explain as much as possible of \(Y\) variation with our model. That means we basically want \(R^2\) as large as possible. The problem with \(R^2\) in multiple regression, is that every additional \(x\) variable we use in the model will increase \(SSR\) and therefore also \(R^2\). The change in \(R^2\) is usually substantial when adding a significant \(x\) variable, and very small when adding a non-significant \(x\) variable. Our goal in setting up a model (by choosing structure and participating \(x\) variables) can be formulated in a way as
- Explain as much \(Y\) variation as possible by using only significant predictor variables.
6.4.2.1 Adjusted \(R^2\)
A variant of the \(R^2\) is the so-called “adjusted \(R^2\).” It takes into account the fact that adding \(x\) variables always increase \(SSR\). By compensating for this we get a measure of explained variation that only increases when significant \(x\) variables are added. The expression for this is \[ R^2_{adj} = 1 - \frac{SSE}{SST}\cdot\frac{n-1}{n-k-1}\;, \] where \(n\) is the sample size, \(k\) is the number of \(x\) variables in the model. Note that the original \(R^2\) can be expressed as \[ R^2 = 1 - \frac{SSE}{SST}\;. \] This reveals how the adjusted \(R^2\) is related to the original, by a correction factor. The two values are almost always quite close. It is commonly understood that the adjusted \(R^2\) is better suited for evaluating the real explanatory power of multiple regression models.
6.4.2.2 Prediction error
As in the case of simple regression, we are very interested in estimating the \(\sigma_e\) parameter in multiple regression models. This is the standard deviation of the error term \(\mathcal{E},\) and it is estimated by \[\begin{equation} S_e = \sqrt{\frac{SSE}{n-k-1}} \;, \tag{6.6} \end{equation}\] where now \(n\) is the number of observations, \(k\) is the number of independent variables. An example of how one can use this estimate, and where we get it via R is given below
6.5 More on \(R^2\) and prediction in multiple regression.
The essential R output from a linear regression estimation is obtained with the summary
function as above for model C. One fundamental number is the R-square (or the adjusted R-square).
In this case, the adjusted and regular \(R^2\) are almost identical, at about 0.93.
another important part of this output is the number called “Residual standard error”. This is precisely the estimate given in equation (6.6). We find \(S_e = 74.59\).
We can utilize this estimate as follows.
if we look at a flat with area 100 \(m^2\), a situated
rating of 4, and distance to city centre 5 km, we would guess the price at
\[ y^* = 73.8 + 9.03\cdot 100 + 19.1\cdot 4 - 19.8 \cdot 5 = 954 \;, \]
i.e 954 000 NOK. A very relevant question is then: How precise is this guess (or prediction)? Then recall that the error term is exactly supposed to describe the variation in prices for any flat with these characteristics, so for a given flat we would get the price at something like
\[ 964 + \mathcal{E}\;.\]
By the assumptions on \(\mathcal{E}\) this means the price should be normally distributed, with mean 964, and standard deviation \(\sigma_e\).
That means the real price with 95% probability should be within 2 standard deviations from 964, and since \(S_e\) is our best guess for the standard deviation, we take \(2\cdot S_e\) as our (approximate) 95% margin of error. Looking at the R summary
output, we find
\[ S_e = 74.6 \;,\]
so to get what we call a 95% prediction interval for the price, we take
\[ 954 \pm 2 \cdot 74.6 = 954 \pm 149.2\;.\]
This shows that when we only consider the three independent variables in this example, we need to accept relatively uncertain predictions. We will see in exercises that adding more independent variables can often improve on this margin of error, i.e. \(S_e\) becomes smaller.
So, while it is good to know a little about the theory behind this, we can also benefit from the quick calculations that R can provide. Again we will use the predict
function, but now also specifying a parameter interval = "p"
which means we ask R to calculate the prediction interval which is default a 95% interval. Goes like this.
#make dataframe with new flat data:
newflat <- data.frame(area = 100, situated = 4, distcen = 5)
#make prediction with 95% interval:
predict(flatregC, newflat, interval = "p")
## fit lwr upr
## 1 953.6793 805.1334 1102.225
The lwr, upr
columns give lower and upper limits for prediction, while fit
is the actual predicted price. This gives the exact prediction interval, our approximated interval above is [804.8 , 1103.2] so we are practically at the same interval.