We will work again on the flat prices data, so we start by reading the file:
flats <- read.csv("M:/Undervisning/Undervisningh21/Data/flat_prices.csv")
head(flats)
with(flats, plot(price, area,
main = "Price vs Area - all flats"))
#run regression
pricereg <- lm(price ~ area, data = flats)
#get coefficients
coef(pricereg)
## (Intercept) area
## 93.696532 9.129971
#get confidence interval
confint(pricereg)
## 2.5 % 97.5 %
## (Intercept) 43.637152 143.755912
## area 8.644193 9.615749
with(flats, plot(area, price, main = "Price vs Area - all flats"))
abline(pricereg, col = "blue", lwd = 2)
with(flats, mean(price / area))
## [1] 10.16392
The two values would be equal only if the constant in the regression was 0.
summary
of the regression. Another key figure is the \(R^2\) here being about 0.90.summary(pricereg)
##
## Call:
## lm(formula = price ~ area, data = flats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -244.484 -67.769 1.346 59.306 218.057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.6965 25.3321 3.699 0.000305 ***
## area 9.1300 0.2458 37.140 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.52 on 148 degrees of freedom
## Multiple R-squared: 0.9031, Adjusted R-squared: 0.9024
## F-statistic: 1379 on 1 and 148 DF, p-value: < 2.2e-16
As stated, we find \(R^2 = 0.90\).
We find \(S_e = 88.52\). This number describes prediction precision since e.g. \(2 S_e\) approximates a 95% error margin.
We can insert \(x = 100\) in the estimated equation. Then we get
pred <- 93.7 + 9.13*100
pred
## [1] 1006.7
As said, we can use \(2 S_e\) for the error margin, so we can get a prediction interval as follows
pred + c(-1, 1) * 2 * 88.5
## [1] 829.7 1183.7
Alternatively, using the predict
function, we can do
df <- data.frame(area = 100)
predict(pricereg, df, interval = "prediction")
## fit lwr upr
## 1 1006.694 831.1867 1182.201
We get essentially the same result.
Ok, my dataframe from the previous execise is called flats
so we go
#make plot, col = "color", pch = "point character",
#we want solid dots (pch = 20)
with(flats, plot(area, price, col = town, pch = 20,
main = "Price vs Area"))
legend("topleft", legend=c("M", "K","A"), col = c(1, 2, 3), pch = 20)
It appears that Ålesund and to some extent Molde has a slightly higher general level, while the slope looks very similar for the three markets.
town
variable directly into a factor, which makes the plot labels come out as text. This is best done with the help of a recode
function from the dplyr
library. (Note that this is not necessary for solving the problem, just a cosmetic upgrade :-))library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#redefine as character:
flats$town <- as.character(flats$town)
#recode as names
flats$town <- recode(flats$town, "1" = "Molde", "2" = "Krsund", "3" = "Alesund")
#convert to "factor" type
flats$town <- factor(flats$town)
# make plot, (now we do not need the "as.factor..." code suggested in the exercise.)
ggplot(flats, aes(x = area, y = price, color = town)) +
geom_point(size = 2) +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
The lines are almost parallell, and there is a vertical shift leaving Molde and Ålesund at a generally higher price level, while the marginal square meter price is almost equal in all three markets.
town
a factor above.) Note: Here it is important to use “==” in the logical expression.regMolde <- lm(price ~ area, data = subset(flats, town == "Molde"))
regKrsund <- lm(price ~ area, data = subset(flats, town == "Krsund"))
regAlesund <- lm(price ~ area, data = subset(flats, town == "Alesund"))
coef(regMolde)
## (Intercept) area
## 128.686384 9.167775
coef(regKrsund)
## (Intercept) area
## 49.780467 8.954979
coef(regAlesund)
## (Intercept) area
## 82.548319 9.950017
We get respectively 9.17, 8.95, 9.95 estimated for \(\beta_1\) in M, K, A.
The confidence intervals:
confint(regMolde)
## 2.5 % 97.5 %
## (Intercept) 60.284118 197.088651
## area 8.477614 9.857936
confint(regKrsund)
## 2.5 % 97.5 %
## (Intercept) -2.337113 101.898048
## area 8.461546 9.448412
confint(regAlesund)
## 2.5 % 97.5 %
## (Intercept) -5.887579 170.98422
## area 9.086685 10.81335
The confidence intervals tell us where we are likely to find the true parameter values in each town. Since all pairs of intervals do overlap, we can not from this see hard evidence for real differences in \(\beta_1\) values.
Optional. We can end this section with an example of how one could solve such problem with some modern mechanisms in R. What happens above (and below) is a VERY common task in data analysis. We have a data frame (flats
). We want to look at subgroups (by town
). We want to do things to subgroups, i.e. compute mean values of variables, run a regression (on the subsets) and so on. The “%>%” sign is called the “pipe”, and it works by sending the result of calculations to the next step in a sequence. So below we start with flats
. Send it to group_by(town)
which makes the subgroups. Then send the subgroups to summarize
which in this case do a few things: Compute min, mean and max price, the number of flats, and report the \(b_1\) estimate from the above regression. Note that the “%>%” is not part of base R, but will work after activating e.g. dplyr
(as we did above. )
flats %>% group_by(town) %>%
summarize(minprice = min(price),
meanprice = mean(price),
maxprice = max(price),
N = n(),
b_1 = coef(lm(price ~ area))[2])
This example is just meant to show how effectively R can be applied to more or less complex data analysis tasks. It is not supposed in general that Log 708 students can work on this level now.
Ok, we need to read the data,
wdata <- read.csv("M:/Undervisning/Undervisningh21/Data/WaterWorld.csv")
head(wdata)
the data seems to include a redundant column. We can leave it there or remove it e.g. as follows
#remove column X
wdata <- subset(wdata, select = -X)
head(wdata)
Another way to remove a column would be wdata <- wdata[ , -1]
which selects everything except the first column.
with(wdata, plot(Price, Tickets,
main = "Tickets sold depending on price",
pch = 20))
From the scatterplot, a linear model looks reasonable. Economic theory usually suggest a non-linear relation of the form \(y = Cx^-\aplha\), where \(C, \alpha\) are constants. (See chap. 8 in compendium for more about this.) In many applications a linear function is a sufficiently good approximation.
wreg <- lm(Tickets ~ Price, data = wdata)
coef(wreg)
## (Intercept) Price
## 3998.09607 -20.21201
So in the form \(y = a - bx\) we use \(a = 3998, b = 20.21\).
This is an estimated marginal effect of Price
on Tickets
so an additional NOK in price is estimated to lead to about 20 less tickets sold.
“Mathematically” the value \(x = 0\) leads to a prediction \(\hat{y} = a\). However, all our data have \(x\) values very far from 0, so a prediction so far outside of the observed range is not valid. We can not use the estimated model at \(x=0\). Economical reasoning would point to other “costs” when the price is 0, like waiting times for access, the facility being overfull etc., where such generalized costs would limit the demand.
We can ask our regression object about it. Either a full summary
:
summary(wreg)
##
## Call:
## lm(formula = Tickets ~ Price, data = wdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -173.96 -60.45 -29.14 59.42 192.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3998.096 104.214 38.36 <2e-16 ***
## Price -20.212 1.047 -19.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92.96 on 33 degrees of freedom
## Multiple R-squared: 0.9187, Adjusted R-squared: 0.9162
## F-statistic: 372.9 on 1 and 33 DF, p-value: < 2.2e-16
Or being more selective:
results <- summary(wreg)
results$r.squared
## [1] 0.9187001
results$sigma
## [1] 92.96295
Either way, we get \(R^2 = 0.92\), and \(S_e = 92.7\).
predict
function to get the prediction along with a prediction interval.testprice <- data.frame(Price = 110)
testpredict <- predict(wreg, testprice, interval = "prediction")
testpredict
## fit lwr upr
## 1 1774.775 1581.382 1968.169
From the difference between lwr
and upr
we find that R uses an error margin of \(0.5 \cdot (1968 - 1581) = 193.5\). So, our approximation above is slightly too low, but still acceptable.
This is fairly straightforward, as each customer generates expected income \(x + s\) and expected cost \(c\). Multiply this by expected number of customers (as given by \(y=a-bx\)) We get expected operational income \(R(x)\) and operational cost \(C(x)\) as follows. \[ R(x)= y(x+s)=(a-bx)(x+s)=ax+as-bx^2-bsx\;, \] \[ C(x)=yc+f=(a-bx)c+f=ac-bcx+f\;. \] Then the operational profit is as follows \[\begin{align*} P(x) & = R(x)- C(x) \\ &= ax+as-bx^2-bsx-(ac-bcx+f) \\ &= -bx^2+(a-bs+bc)x+as-ac-f \end{align*}\]
The profit is a quadratic function of \(x\). We maximize it simply by solving the equation \(P^\prime(x) = 0\). This becomes \[ P^\prime(x) = - 2bx + (a - bs + bc ) = 0\;, \] for which the only solution is \(x= x^*\) as given in the exercise.
Using the expression for \(x^*\) with the given values for \(s, c\) and \(f\) as well as previously found estimates for \(a, b\), we find the optimal price at \(x^* = 96.40\). The expected number of customers with this price is 2050, and the expected profit is \(P(x^*) = 57 826\). Putting the price higher, at \(x = 110\) would lead to an expected loss of about 3740 NOK per week.
The prices we consider as realistic is then rounding down to 95 or up to 100. Compared to the optimum, we can expect to lose only 40 NOK per week at price 95, while we lose 260 NOK per week at price 100. That means \(x = 95\) would be the suggested price.
We can round off here with an example of how we might use R to assist in such calculations as in i. and j. We can use the function
construction to define the profit as a function of a variable x
. Then this function can be used with various input, and we can make a plot. We will use the general parameters \(a, b,...\) as part of the function, to make it more general.
#set parameter values.
a <- 3998
b <- 20.21
s <- 10
c <- 5
f <- 150000
#define function "P(x)" as deduced in g.
P <- function(x) { -b*(x^2) + (a - b*s + b*c)*x + a*s - a*c - f }
#compute a sample of values for an x vector:
x <- seq(from = 85, to = 115, by = 5)
data.frame(price = x, tickets = a - b*x, profit = P(x))
So at the (close to) optimal price 95, we expect to sell 2078 tickets earning a profit of 57800 NOK.We can plot the function, like
curve(P(x), from = 40, to = 160, n = 100,
lwd = 2, col = "green",
main = "Profit as depending on ticket price",
sub = "Based on a demand regression model")
abline(h = 0, lwd = 2)
abline(v = 96.4, lty = 2)
So, here we see the “whole story” in a picture, profits occur in a fairly wide price range, with a theoretical optimum at 96.4 NOK, (dashed line)