Chapter 6 Linear multiple regression models

Exercise 6.1 The data file for this exercise is Wages.csv. You’ll find it in the usual place. The file contains data for hourly wages and possibly related variables for \(n = 1289\) US workers. The data is from 1995, worker ages range from 18 to 65. In this exercise we will only use a few of the involved variables, and study their possible effect on wage levels. Briefly stated, we restrict attention to the continuous variables

  • \(Y =\) `wage = hourly wage in US$.
  • \(X_1 =\) education = total number of years of schooling.
  • \(X_2 =\) exper = estimated years of work experience.
  • \(X_3 =\) age. We should note that the \(X_2\) variable was not directly observed as exact years of work experience, but rather computed by the formula \[ X_2 = X_3 - X_1 - 6\;. \]
  1. Try to explain why the formula for \(X_2\) makes (some) sense. For what workers will the formula give a correct value for the exper value? For what workers will it give a wrong value? Do you think there could be more correct values for females or males with this formula, or would it be the same for both genders?

  2. One might be tempted to estimate a linear regression model using \(Y\) as dependent variable, and \(X_1, X_2, X_3\) as independent, so as to capture the simultaneous effect of age, education and work experience on wage levels. Why is this not possible with these data? Try with R and see what you get…

  3. Let’s define model A as the simple model using only education as a wage driver, formally: \[ Y = \beta_0 + \beta_1 X_1 + \mathcal{E}\;. \] Estimate parameters for this model, find the \(R^2\). Also find a 95% confidence interval for the \(\beta_1\) parameter.

  4. In model B, we add the variable \(X_2\) to the equation, to get \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \mathcal{E}\;. \] Does it seem that both \(X\) variables are significantly affecting \(Y\), are the signs of the effects as expected? What happens to the \(R^2\) when adding \(X_2\)? Does the inclusion of \(X_2\) appear to change the effect of \(X_1\) substantially? Why or why not?

  5. What is the estimated value of an extra year of education/work experience according to model B? (In terms of US$ per hour for the worker).

  6. If Maria and Jane has work experience of 10 and 15 years respectively, and both have 12 years of education, find forecasts for their wages, with 95% approximate error margins.

  7. Compared to Maria, will model B predict a different wage for Jim, who has the same education and experience? Why or why not? (You are allowed to assume Maria is female and Jim is male.)

Exercise 6.2 Find the file used_cars.csv in Canvas. Try to run R regressions with the linear models A, B the section “omitted variable bias” in the compendium, with the variable definitions given there. Make sure to produce 95% confidence intervals for coefficients.

  1. Write down the two estimated models.
  2. Verify the radical differences in the estimates of the mileage’s effect on prices.
  3. In your own words, try to explain what the \(\beta_3\) parameter means for the two models.
  4. In the two models, the confidence intervals for \(\beta_3\) are far from overlapping. Does it mean any of the intervals are wrong?

Exercise 6.3 Using a linear model like those in exercise 6.2 means we assume prices of aging cars fall by the same amount each year and by each 1000Km driven. A more realistic description would be to assume prices falling by a certain percentage each year, and a certain percentage by each 1000Km driven. Try to formulate a model representing percentage changes, using the three \(X\) variables. You could use \(p, q\) as the average percentage drop in prices. Note: The resulting model will be nonlinear!