Chapter 4 Hypothesis testing

In all tests, use significance level $\alpha = 0.05$ unless otherwise specified

Exercise 4.1 In a supermarket, the manager tries to forecast and plan operations for the next quarter. One assumption is that the mean spending $\mu$ per customer is 290 NOK. A second assumption is that only 20% of customers pay with cash, while 10% of customers make a cash withdrawal using bank card. In a sample of 400 customers, the average spending was found to be $\bar{x} = 319$ NOK, with a standard deviation of $S_x = 100$. Also in the sample, 56 of the 400 customers payed with cash. We can help the manager test her assumptions - formulated as null hypotheses.

With $\mu$ as above, test \[ H_0: \mu = 290 \ \mbox{vs.} \ H_1: \mu \ne 290\;, \] using 0.05 as significance level. (This is a one-sample t-test, use $N(0,1)$ as approximate null distribution)
Let $p$ be the proportion of customers paying cash. Test \[ H_0: p = 0.20 \ \mbox{vs.} \ H_1: p \ne 0.20. \]
Suppose there are 10% customers who makes a cash withdrawal, and that their average withdrawal is 400 NOK. Suppose 20% of customers pay with cash. Would you agree that these figures together with the average spending for all customers, imply that the supermarket will generally have a positive cash balance (so as to be able to give back change, and to accommodate withdrawals)?

Exercise 4.2 Find the file Trip_durations.csv. These data were used as an example in section 4.5 of the compendium. Read data into a dataframe. Let $\mu_d$ be the (true) mean duration of trips, and let $\mu_s$ be the mean distance of trips.

Focus on the Duration variable. Use R to find the sample size $n$, mean $\bar{x}$ and standard deviation $S_x$ for the sample in the data file.
Suppose we want to test \[ H_0: \mu_d = 24 \ \mbox{vs.} \ H_1: \mu_d \ne 24 \] based on the data. Write an expression for the test statistic $T$. What is the null distribution of $T$? Compute the observed value $T_{_{OBS}}$ without using the t.test function in R. Using $N(0,1)$ as approximate null distribution, find the P-value for the test. Finally, if the significance level is $\alpha = 0.05$, should we reject $H_0$?
Run the one-sample t-test for the above hypotheses in R. Compare the $T_{_{OBS}}$ value and the P-value you found with R. They should be very similar.
Repeat step b) and c) for the hypotheses \[ H_0: \mu_d = 23 \ \mbox{vs.} \ H_1: \mu_d \ne 23\;. \]
Now, focus on the distance variable. Does the sample provide significant evidence that $\mu_s > 10$? Use significance level $\alpha = 0.05$. Formulate the appropriate hypotheses and use R to find the P-value.
Use R to find 95% confidence intervals for the means $\mu_d, \mu_s$. Do you see a connection between the conclusions in the tests and the confidence intervals?
One would expect a very strong correlation between the distance and duration of a trip. Find the correlation coefficient, and make a graphical display of the relation between the variables.

Exercise 4.3 Find the file flat_prices.csv. Save a copy. The file has data for sold flats in Molde, Kristiansund and Ålesund. The data is somewhat dated, and does not reflect current market prices. That is not important. We will try some two-sample t-tests to compare prices in the three towns. In all tests, use $\alpha = 0.05$ for the significance level.

Take some time to familiarize yourself with the data material. Figure out how the three towns are encoded by 1, 2, 3. Find the mean prices in the three towns. (Hint: in R tapply(x, g, f) will apply the function f to the variable x for each group defined by g.)
Make a boxplot for the prices in the three towns. (Hint: with(DF, boxplot(x ~ g)) will make a boxplot of x for each group in g, using data from a dataframe DF).
Compare the mean selling price in Molde and Kristiansund, using a two-sample t test and R. Use the two-sided alternative. You should find that the samples do not provide significant evidence of different mean prices. (Hint: the t.test function in R requires the grouping variable to have only two values, but town has three. use subset(DF, town != 3) to make a new dataframe with only Molde, Kristiansund flats. Use this in the t.test as described in the text) Try the same with Kristiansund and Ålesund. In the latter case, you come close to rejecting $H_0$ but still not all the way.
Find the sample mean square meter sizes in the three towns. Where are on average the larger flats in the sample? (Hint: same as in a.)
Study the general correlation between square meter size and selling price in the sample. Make a scatterplot to visualize. (Hint: the function cor is useful.)
Make a new variable, sqmprice that measures the square meter price, i.e. the selling price, divided by the square meter size. (Hint: DF$z <- DF$x / DF$y computes the ratio of x to y and saves in a column z for a dataframe DF)
Now compare the mean square meter prices in Molde and Kristiansund with a two-sample t-test. (Hint: Same as in c, just different variable)
Try to explain why the results become so different when comparing nominal prices as opposed to comparing square meter prices. (There are at least two reasons involved in this case).
Compare Kristiansund vs. Ålesund, and Molde vs. Ålesund in terms of mean square meter prices.

Exercise 4.4 Work on the data in flat_prices.csv. Let’s define the parameter $p$ as the (true) proportion of flats sold that have 2 rooms or less. With significance level 0.05, we want to test the hypotheses \[ H_0: p=0.50 \ \mbox{vs.} \ H_1: p < 0.50\;. \]

Try to make a vector countsm that shows the number of small flats (less than 3 rooms) and large flats (all others). Hint: make a new variable in your dataframe, small indicating whether a flat is small or not, then use table to count the occurrences of values for small.
Your vector countsm from a. should be [59, 91] indicating that there are 59 small and 91 not-small flats. Use this with the binom.test function to test the proposed hypotheses.

Exercise 4.5 Work on the data in flat_prices.csv. We want to check whether the selling prices of houses appear to follow a normal distribution. Follow the outline in the end of chapter 4 of the text when using R for this.

Think of a typical estate market. Would you expect prices to approximately follow a normal distribution? What typical characteristic of such markets could be in conflict with a normal distribution?
Do the normality tests for the price variable in the data set. Comment on your findings. Was your thinking in a) right?
Repeat the testing for normality, now using town as a factor. It means you do the test within each town’s estate market. How would you explain the difference in results for the towns? ( Hint: If your dataframe is Df try something like with(DF, tapply(price, town, qqnorm)) to apply the normal plot function to price for each town. Do something similar with the function shapiro.test.
Are the square meter prices normally distributed? (Assuming you have computed the new variable as suggested in exercise 4.3).