Final Exam Summer II 2016 Answers Part A 1. a. The greek letter sigma denotes the population standard deviation. 2. c. The sample mean is a random variable because it changes every time a new experiment is run. The other choices are constants, which are chosen before the experiment is run. 3. c. Because the sample mean is the average of independent outcomes from a random variable, the central limit theorem states that the distribution of the sample mean will get closer to a normal as the sample size gets larger, even if the population distribution is not normal. 4. c. If the distribution is skewed to the right, the population mean, but not the population median, will be pulled to the right. This means that the mu, the population mean, > nu, the population median. 5. a. If you draw a line through the middle points of the normal plot, you see that the extreme points are below the line of the left and above the line on the right, which means thick tails. 6. This is a bad question. Everyone received credit for this question. 7. This is also a bad question, for which there are no correct answers. This question was omitted. 8. a. All of the other plots are either biased or heteroscedastic. 9. d. A model has a multicollinarity problem is the variance inflation factor VIF is greater than 5. 10. For the in-class section, the odds ratio for p = 0.2 is p / (1 + p) = 0.2 / (1 + 0.2) = 1 / 6 = 0.1667, which was none of the above. For the online section, the odds ratio for p = 0.25 is 0.25 / (1 + 0.25) = 1 / 5 = 0.2. Part B: 1. Using the Tukey's Hinges method, where the middle observation is counted in both halves of the dataset, Q1 = 75, Q2 = 81, Q3 = 95, IQR = 95 - 75 = 20. Here is the boxplot: 8 43 +--+----+ * O | | |- +--+----+ +---+---+---+---+---+---+---+---+---+---+ 0 10 20 30 40 50 60 70 80 90 100 8 is an extreme outlier; it is outside of the outer fence at 75 - 3 * 20 = 15. The inner fence is at 75 - 1.5 * 20 = 45. 43 is a mild outlier; it is between the inner fence and the outer fence. 2. When n = 3, the normal scores divide the standard normal density into n + 1 = 4 equal areas of 1 / 4 = 0.25 each. Look up 0.25 in the body of the z-table to obtain the z-score -0.67. Also look up 0.5 and 0.75 to obtain 0.0 and 0.67, respectively: the expected normal scores are -0.67, 0.0, and 0.67. 3. a. The null hypotheses H0 is that mu = 2800. b. Assuming H0, the test statistic t = (xbar - mu) / (s / sqrt(n)) = = (3075 - 2800) / (500 / sqrt(25)) = 2.75 c. Using the t-table with df = n - 1 = 25 - 1 = 24, a 95% confidence interval for the test statistic is I = (-2.064, 2.064). d t is not in I, so reject the null hypothesis. 4. xbar = 5, sx = 1.5, ybar = 1000, sy = 75, rxy = 0.8 y - ybar = ((rxy * sy) / sx) * (x - xbar) y - 1000 = (0.8 * 75) / 1.5 * (x - 5) y - 1000 = 40 * (x - 5) y = 40 * x - 200 + 1000 y = 40 * x + 800 5. The confidence interval for the true expected value of y_i is symmetric around the predicted value y_i hat. The prediction interval for a new observation is also symmetric around y_i hat, the this interval is wider because the standard deviation of a new observation is larger than the standard error of the predicted value. 6. p = 7 + 1 = 8, n = 24, SSM = 127, SSE = 95, DFM = p - 1 = 7, DFE = n - p = 24 - 8 = 16. F = (SSM/SSE) / (DFM/DFE) = (127/7) / (95/16) = 3.056. A 95% confidence interval for the F statistic, assuming H0 that none of the independent variables are significant, is I = (0, 2.66). F is not in I, so reject H0. 7. R-squared = SSM / (SSM + SSE) = 127 / (127 + 95) = 0.572 Adjusted R-squared 8. Let b = beta_a, bhat = beta_1 hat, and seb = the standard error of beta_1 hat. t = (bhat - b) / seb = (3.5 - b) / 1 = 3.5 - b. Use a t-table to find a 95% confidence interval for t, with degrees of freedom = DFE = 16: (-2.12, 2.12). -2.12 <= 3.5 - b <= 2.12, 1.38 <= b <= 5.62, so the confidence interval is (1.38, 5.62). 9. An influence point is a point that creates a large change in the regression model if it is removed from the dataset. Here are some popular measures of influence: z_i* = the externally studentized residual for observation i. h_ii = the ith element of the diagonal of the hat matrix H. It indicates the relative contribution of y_i to y_i hat. Cook's D= the average squared difference of y_j hat and y_j(i) computed with the ith observation deleted. DFBETAS_j(i) = the difference of the jth regression coefficient and the jth regression coefficient computed with the ith observation deleted. DFFITS_j(i) = the difference of the jth predicted value jth regression coefficient computed with the ith observation deleted. 10. a. Split the dataset into a test set TEST and a training set TRAIN. Find the regression model using TRAIN. Test the model on TEST, for example, finding the R-squared for prediction. b. Use k-fold crossvalidation: split the data into k subsets. Fit a regression model to each subset and use the rest of the data to compute R-squared for prediction. c. Jackknifing or leave-one-out crossvalidation. Compute the PRESS statistic. 11. The logistic regression equation is log(p / (1 - p)) = b_0 + b_1 * x_1 + b_2 * x_2 + ... + b_p-1 x_p-1 where the distribution is binomial and the link function is the logit function defined by logit(p) = log(p / (1 - p)). Thus p, the probability of success, is determined by the settings of the independent variables. A maximum likelihood algorithm is used to compute the estimated coefficients. Part C: 1. Model 2 and Model 3 are the best regression models. This is because the adjusted R-squared values are large and the VIF values are < 5.0. I prefer Model 2 because it is more parsimonious; its adjusted R2 value of 0.9733 is almost as large as the 0.9868 of Model 3. The residual plots for Model 2 show that the residuals are unbiased, homoscedastic, and normal. R identifies three influence points. These are due to the Cook's D statistic and the Cov Ratio, which we did not discuss. 2. To get yhat, substitute into the regression equation: yhat = -111.54 + 21.52 * x4 + 23.08 * x5 = -111.54 + 21.52 * (6000/1000) + 23.08 * (10,000/1000) = -111.54 + 21.52 * 6 + 23.08 * 10 = 248.38 The predicted annual sales is 248.38 * 1000 = $248,380.