t_obs <- (319 - 290)/(100/sqrt(400))
t_obs
## [1] 5.8
The observed value is 5.8. Now, since the test is two-sided, we get the P-value as \[P = P[T \leq -5.8 \mbox{ or } T \geq 5.8] = 2\cdot P[T \leq 5.8]\]
pval <- 2*pnorm(-5.8)
pval
## [1] 6.631492e-09
As expected from the observed T value, the P-value is practically 0. We reject \(H_0\) at the given level.
The P-value is now \(P = 2 \cdot P[Z \leq -3.0] = 0.003\), so again we reject the null hypothesis. The proportion of cash paying customers is likely to be quite a bit lower than the suggested 0.20.
We start by reading the data and look at top rows
tripdata <- read.csv("M:/Undervisning/Undervisningh21/Data/Trip_durations.csv")
head(tripdata)
n <- nrow(tripdata)
x_bar <- mean(tripdata$Duration)
s_x <- sd(tripdata$Duration)
#print the three numbers as a vector (just to save space in output)
c(n, x_bar, s_x)
## [1] 200.000000 24.310500 6.835471
So \(n = 200, \bar{x} = 24.31, S_x = 6.84\).
t_obs <- (x_bar - 24)/ (s_x / sqrt(n))
t_obs
## [1] 0.6424039
We get \(T_{obs} = 0.64\). The test is two-sided, so we get the P-value as follows. \[ P= P[T \leq -0.64 \mbox{ or } T \geq 0.64] = 2*P[T \leq -0.64] .\] Using R and the \(N(0,1)\) distribution, we get
p_value <- 2*pnorm(-0.64)
p_value
## [1] 0.5221726
The \(P\)-value is about 0.52 and with significance level \(\alpha = 0.05\) we can not reject the \(H_0\).
t.test
function from R.durtest1 <- t.test(tripdata$Duration, mu = 24, alternative = "two.sided")
durtest1$statistic
## t
## 0.6424039
durtest1$p.value
## [1] 0.5213503
We see that we get almost identical values.
t_obs <- (x_bar - 23)/ (s_x / sqrt(n))
t_obs
## [1] 2.711338
Using R and the \(N(0,1)\) distribution, we get a new \(P\)-value:
p_value <- 2*pnorm(-2.71)
p_value
## [1] 0.006728321
And in this case we clearly reject \(H_0\). We can confirm this by testing directly with R:
durtest2 <- t.test(tripdata$Duration, mu = 23, alternative = "two.sided")
durtest2$statistic
## t
## 2.711338
durtest2$p.value
## [1] 0.007287267
Again, the results are very close, and the same conclusion applies.
disttest <- t.test(tripdata$Distance, mu = 10, alternative = "greater")
#now we can chech the whole output.
disttest
##
## One Sample t-test
##
## data: tripdata$Distance
## t = 1.9522, df = 199, p-value = 0.02616
## alternative hypothesis: true mean is greater than 10
## 95 percent confidence interval:
## 10.06158 Inf
## sample estimates:
## mean of x
## 10.40115
We get \(T_{obs} = 1.95\) and a resulting \(P\)-value at 0.026. This is below 0.05, so we reject \(H_0\).
t.test
, and also the mu
value is irrelevant for the confidence intervals, we can simply dodur_test <- t.test(tripdata$Duration)
dist_test <- t.test(tripdata$Distance)
dur_test$conf.int
## [1] 23.35737 25.26363
## attr(,"conf.level")
## [1] 0.95
dist_test$conf.int
## [1] 9.995949 10.806351
## attr(,"conf.level")
## [1] 0.95
In the case of the duration variable, we see the interval containing 24, but not 23, which explains the different results in b,c contra d. Regarding the distance variable, we see the 95% confidence interval reaching from practically 10.00 to 10.80, so indicating that the \(\mu_s\) is greater than 10.00, although we get a relatively weak evidence. We also see this since the \(P\)-value in this case was not much lower than 0.05.
with(tripdata, cor(Duration, Distance))
## [1] 0.9461513
with(tripdata, plot(Distance, Duration,
main = "Duration vs Distance for road trips."))
The correlation is about 0.95, and the plot shows strong dependency between the variables.
flats <- read.csv("M:/Undervisning/Undervisningh21/Data/flat_prices.csv")
head(flats)
From the (updated) exercise text, we find the encoding for town
as (1, 2, 3) for (Molde, Kristiansund, Ålesund).
we can use the tapply
function to compute means in the towns as follows.
with(flats, tapply(price, town, mean))
## 1 2 3
## 1008.122 949.500 1064.538
We see some clear price differences on average.
with(flats, boxplot(price ~ town))
This points clearly to the same differences, while also showing a slightly more widespread distribution in Ålesund.
\[H_0: \mu_M = \mu_K \ \mbox{ vs } \ \mu_M \ne \mu_K \] where the \(\mu\)’s are means in Molde, Kristiansund. We use the t.test
as suggested, after excluding town 3 from the data. The “two.sided” is the default alternative, and need not be specified.
flats_MK <- subset(flats, town != 3)
with(flats_MK, t.test(price ~ town))
##
## Welch Two Sample t-test
##
## data: price by town
## t = 1.1339, df = 99.663, p-value = 0.2596
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.95448 161.19839
## sample estimates:
## mean in group 1 mean in group 2
## 1008.122 949.500
# or: t.test(flatsMK$price ~ flatsMK$town)
The relatively high P-value means we can not reject the null hypothesis.
For Ålesund, Kristiansund, we run the same procedure
flats_KA <- subset(flats, town != 1)
with(flats_KA, t.test(price ~ town))
##
## Welch Two Sample t-test
##
## data: price by town
## t = -1.9751, df = 82.279, p-value = 0.05161
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -230.8975312 0.8206081
## sample estimates:
## mean in group 2 mean in group 3
## 949.500 1064.538
Supposing the significance level is 0.05, the P-value here is at the limit, but still does not lead to a rejected null hypothesis.
area
. We can use the method from a).with(flats, tapply(area, town, mean))
## 1 2 3
## 95.92683 100.47143 98.69231
So, on average the sample has somewhat different sized flats.
with(flats, cor(price, area))
## [1] 0.9503176
Strong correlation, at 0.95. In this connection, showing a scatterplot does not hurt.
with(flats, plot(area, price))
That looks more or less as expected.
flats$sqmprice <- flats$price / flats$area
head(flats)
sqmprice
variable. So we can doflats_MK <- subset(flats, town != 3)
with(flats_MK, t.test(sqmprice ~ town))
##
## Welch Two Sample t-test
##
## data: sqmprice by town
## t = 6.6785, df = 75.642, p-value = 3.581e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7935811 1.4681197
## sample estimates:
## mean in group 1 mean in group 2
## 10.63212 9.50127
The P-value is almost 0, and much less than 0.05, so we reject the null hypothesis. We conclude that square meter prices are significantly higher in Molde than in Kristiansund.
When comparing nominal prices, we do not take into account that while the prices were on average higher in Molde, the flats were also smaller in Molde. So, comparing nominal prices can be misleading when we don’t control for the fact that sizes differ on average. Looking at square meter prices is one way to make a more “fair” comparison as the size is taken into account.
So, we can make a few more subsets, and run similar tests:
flats_KA <- subset(flats, town != 1)
flats_MA <- subset(flats, town != 2)
with(flats_KA, t.test(sqmprice ~ town))
##
## Welch Two Sample t-test
##
## data: sqmprice by town
## t = -8.0705, df = 72.688, p-value = 1.066e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.695627 -1.023977
## sample estimates:
## mean in group 2 mean in group 3
## 9.50127 10.86107
with(flats_MA, t.test(sqmprice ~ town))
##
## Welch Two Sample t-test
##
## data: sqmprice by town
## t = -1.1576, df = 77.974, p-value = 0.2505
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6226931 0.1647894
## sample estimates:
## mean in group 1 mean in group 3
## 10.63212 10.86107
We see \(H_0\) rejected when comparing Kristiansund and Ålesund, while not when comparing Molde and Ålesund.
As a final remark, when working with categorical variables like town
here, it can be worthwile to convert to a “factor”. Some code for this is shown in section 7.5 in the compendium. If we do this and rerun question a, we get
with(flats, tapply(price, town, mean))
## Molde Krsund Alesund
## 1008.122 949.500 1064.538
So instead of constantly trying to remember what was 1, 2, 3 - we now get the actual names in the output.