Chapter 5 Basic Regression Analysis

Exercise 5.1 Consider the basic model assumptions C, D for regression. Try to think of situations with variables \(X\) and \(Y\) where we may see violations of assumptions C and D, respectively. (Hint: Often the problem comes from using wrong (too simple) models. Think about how a scatterplot might look if C or D fails.) We restate the conditions here:

We consider the linear model formulated by \[ Y_i = \beta_0 + \beta_1 X_i + \mathcal{E}_i \;, \] With this formulation, we can state the basic model assumptions.

C. The error term is statistically independent of \(X\).

D. For all \(i\), the standard deviation is the same, \(\mbox{Sd}{[\mathcal{E}_i]} = \sigma_e\).

Exercise 5.2 Open the file flat_prices.csv, that we have earlier worked on. We will start by studying the relation between flat sizes and selling prices. Remember the prices in the data does not reflect current price levels.

  1. Let us first do it simple, and ignore the fact that the prices are collected in three towns with potentially different housing market parameters. Start by producing a scatterplot of area vs price.
  2. Run a regression with prices as dependent variable, and square meter size as the independent variable. Find the estimated coefficients \(b_0 ,b_1\), and 95% confidence intervals for the true regression parameters \(\beta_0, \beta_1\).
  3. Reproduce the scatterplot from part a. and add the regression line estimated in part b.
  4. How would you interpret the value of the parameter \(\beta_1\) (or the estimate \(b_1\)) in this regression model? (Hint: what would be the difference between a flat of size 93 and a flat of size 94 square meters according to the model?)
  5. What part of the regression results shows that the square meter size really does have an impact on the prices? You may be able to point out two related ways to see this from the output.
  6. What is the \(R^2\) value for this regression? What does this number tell us about the impact of square meter size on selling prices?
  7. What is the value of the error term standard deviation \(S_e\) for this regression? Try to explain what this number represents in the regression.
  8. What would be your best guess for the price of a 100 \(m^2\) flat? How can the \(S_e\) from the previous question be used to evaluate the precision of your guess?

Exercise 5.3 Continue working on the flat prices data. We will do a little preliminary analysis to see whether there are any detectable systematic differences between the house markets in the three towns involved in the data. We recall that the coding of the variable town is as follows.

  • town = 1 for Molde
  • town = 2 for Kristiansund
  • town = 3 for Ålesund
  1. Use the following code to make a scatterplot of prices vs square meter sizes, with different color dots for Molde, Kristiansund, Ålesund. The code assumes your dataframe is called flp, just change that to whatever you call it.
#make plot, col = "color", pch = "point character", 
#we want solid dots (pch = 20)
with(flp, plot(area, price, col = town, pch = 20, 
              main = "Price vs Area")) 
legend("topleft", legend=c("M", "K","A"), col = c(1, 2, 3), pch = 20)

Do you see anything indicating a difference in the three markets?

  1. We want to have a similar plot with one regression line for each town. This can be achieved with base R plotting, but is much better done with ggplot. Try the code below, you need to run install.packages("ggplot2") once to make this work.
library(ggplot2)
ggplot(flp, aes(x = area, y = price, color = as.factor(town))) +
  geom_point(size = 2) + 
  geom_smooth(method = lm, se = FALSE)

Do we see any interesting differences in the lines?

  1. Use the subset option to run regression analysis on the price vs size relation, with separate estimates. for each town. Make sure to produce 95% confidence intervals for coefficients.E.g. for Molde, you can do
#run regression for only Molde data, look at coefs and confidence intervals.
regMolde <- lm(price ~ area, data = subset(flp, town == 1))
coef(regMolde)
confint(regMolde)
  1. Find the estimated (marginal) square meter price (\(\beta_1\)) in each town.
  2. Just judging from the confidence intervals, do we find evidence for any real differences in the square meter prices between any pair of towns?

Exercise 5.4 One aspect of Supply Chain Management is decisions relating to pricing (i.e. to set prices optimal for your business) of products and services. Here is an example that illustrates possible use of regression analysis as a tool to support the decision making. Suppose you are the manager of WaterWorld, a center with all kinds of water-sports and comfort facilities. For simplicity let’s assume there is one single admission price called \(x\). If you put \(x\) too high, no customers will come, if too low, many customers will come, but you will not earn any profit from them. The crucial relation is between the price on one hand, and the demand on the other. Demand in this case means number of tickets sold in a period. Assume the price \(x\) is kept fixed for one week at a time, and that we observe the demand \(y\) in the corresponding week. We assume here a linear relation in regression sense \[ y = \beta_0 + \beta_1 x + \mathcal{E} \] In the data file ´WaterWorld.csv` you find data for demand (number of sold tickets) and prices for 35 weeks. Assume WaterWorld is in Norway, with prices in NOK. These data can be used to estimate a linear relation in line with the model.

  1. Make a scatterplot with regression line to visualize the relation. Comment on the use of a linear model. What would economic theory suggest for the relation in this case?
  2. Write \(y = a - bx\) for the estimated regression line. Find values for \(a\) and \(b\).
  3. How do you interpret the estimate \(b\)?
  4. What can the estimated model tell us about a situation where we put the price to 0? Will we expect to “sell” \(a\) tickets at 0 NOK?
  5. Comment on the values of \(R^2\) and \(S_e\). What do they tell us about the relation?
  6. Find a prediction for the number of sold tickets and an approximate 95% margin of error if we put the price to \(x=110\).
  7. Let \(P(x)\) denote the expected profit (Here we mean operational profit, i.e. before including interest, taxes and other financial cost/income into the accounting). This profit is simply the operational income minus the operational cost. Assume the following situation
  • The expected number of customers is as estimated by the regression, i.e. \(y = a - bx\).
  • Each customer generates an expected income of \(x + s\), where \(s\) is additional income from sale of food and drinks inside WaterWorld.
  • Each customer generates an expected cost of \(c\), due mainly to energy cost for showers.
  • There is a fixed cost \(f\) covering salaries, energy, etc. that is independent of the number of customers.
  • These are the only operational income and cost factors.

Show that \(P(x)\) can be expressed as \[ P(x) = - bx^2 + (a - bs + bc) x + as - ac - f \]

  1. Try to show that the profit \(P(x)\) is maximized by \[ x = x^* = \frac{a - bs + bc}{2b}\;. \]
  2. Assuming that \(s = 10, c = 5, f = 150000\), find the optimal price \(x^*\), as well as the expected number of customers and the expected profit, given that price. What would be lost if using \(x=110\) instead of \(x^*\)? You need no advanced math.
  3. The price \(x^*\) may not work well in practice because it is a decimal number. Suppose we want the price to be some whole multiple of 5 NOK. Which is more profitable: Rounding \(x^*\) up or down to the nearest multiple of 5?