Final Exam Summer II 2016 Answers

Part A

1. a. The greek letter sigma denotes the population standard deviation.

2. c. The sample mean is a random variable because it changes every time
         a new experiment is run.  The other choices are constants, which 
         are chosen before the experiment is run.

3. c. Because the sample mean is the average of independent outcomes from
      a random variable, the central limit theorem states that the distribution
      of the sample mean will get closer to a normal as the sample size gets
      larger, even if the population distribution is not normal.
      
4. c. If the distribution is skewed to the right, the population mean, but not
      the population median, will be pulled to the right.  This means that
      the mu, the population mean, > nu, the population median.
      
5. a. If you draw a line through the middle points of the normal plot, you see
      that the extreme points are below the line of the left and above the line
      on the right, which means thick tails.
      
6.    This is a bad question. Everyone received credit for this question.

7.    This is also a bad question, for which there are no correct answers.
      This question was omitted.
      
8. a. All of the other plots are either biased or heteroscedastic.

9. d. A model has a multicollinarity problem is the variance inflation factor
      VIF is greater than 5.
      
10.   For the in-class section, the odds ratio for p = 0.2 is p / (1 + p) =
      0.2 / (1 + 0.2) =  1 / 6 = 0.1667, which was none of the above.
      For the online section, the odds ratio for p = 0.25 is 0.25 / (1 + 0.25)
      = 1 / 5 = 0.2.
      
Part B:

1.  Using the Tukey's Hinges method, where the middle observation is counted 
    in both halves of the dataset, Q1 = 75, Q2 = 81, Q3 = 95, IQR = 95 - 75 = 20.
    Here is the boxplot:
    
       8              43          +--+----+
       *               O          |  |    |- 
                                  +--+----+
    +---+---+---+---+---+---+---+---+---+---+
    0  10  20  30  40  50  60  70  80  90  100
    8 is an extreme outlier; it is outside of the outer fence at 75 - 3 * 20 = 15.
    The inner fence is at 75 - 1.5 * 20 = 45.  43 is a mild outlier; it is between
    the inner fence and the outer fence.
    
2.  When n = 3, the normal scores divide the standard normal density into 
    n + 1 = 4 equal areas of 1 / 4 = 0.25 each.  Look up 0.25 in the body of the
    z-table to obtain the z-score -0.67.  Also look up 0.5 and 0.75 to obtain
    0.0 and 0.67, respectively: the expected normal scores are -0.67, 0.0, and 0.67.
    
3.  a. The null hypotheses H0 is that mu = 2800.
    b. Assuming H0, the test statistic t = (xbar - mu) / (s / sqrt(n)) = 
                                         = (3075 - 2800) / (500 / sqrt(25)) = 2.75
    c. Using the t-table with df = n - 1 = 25 - 1 = 24, a 95% confidence interval
       for the test statistic is I = (-2.064, 2.064).
    d  t is not in I, so reject the null hypothesis.
    
4.  xbar = 5, sx = 1.5, ybar = 1000, sy = 75, rxy = 0.8
    y - ybar = ((rxy * sy) / sx) * (x - xbar)
    y - 1000 = (0.8 * 75) / 1.5 * (x - 5)
    y - 1000 = 40 * (x - 5)
    y = 40 * x - 200 + 1000
    y = 40 * x + 800
    
5.  The confidence interval for the true expected value of y_i is symmetric around
    the predicted value y_i hat.  The prediction interval for a new observation is
    also symmetric around  y_i hat, the this interval is wider because the standard
    deviation of a new observation is larger than the standard error of the predicted
    value.
    
6.  p = 7 + 1 = 8, n = 24, SSM = 127, SSE = 95, DFM = p - 1 = 7, DFE = n - p = 24 - 8 = 16.
    F = (SSM/SSE) / (DFM/DFE) = (127/7) / (95/16) = 3.056.
    A 95% confidence interval for the F statistic, assuming H0 that none of the
    independent variables are significant, is I = (0, 2.66).  F is not in I, so
    reject H0.
    
7.  R-squared = SSM / (SSM + SSE) = 127 / (127 + 95) = 0.572
    Adjusted R-squared
   
8.  Let b = beta_a, bhat = beta_1 hat, and seb = the standard error of beta_1 hat.
    t = (bhat - b) / seb = (3.5 - b) / 1 = 3.5 - b.  Use a t-table to find a 95% confidence
    interval for t, with degrees of freedom = DFE = 16:  (-2.12, 2.12).
    -2.12 <= 3.5 - b <= 2.12, 1.38 <= b <= 5.62, so the confidence interval is
    (1.38, 5.62).
    
9.  An influence point is a point that creates a large change in the regression model
    if it is removed from the dataset.  Here are some popular measures of influence:
    z_i*   = the externally studentized residual for observation i.
    h_ii   = the ith element of the diagonal of the hat matrix H.  It indicates the relative
             contribution of y_i to y_i hat.
    Cook's D= the average squared difference of y_j hat and y_j(i) computed with the ith
              observation deleted.  
    DFBETAS_j(i) = the difference of the jth regression coefficient and the jth regression
                   coefficient computed with the ith observation deleted.
    DFFITS_j(i)  = the difference of the jth predicted value jth regression coefficient 
                   computed with the ith observation deleted.

10. a. Split the dataset into a test set TEST and a training set TRAIN.  Find the 
       regression model using TRAIN.  Test the model on TEST, for example, finding the 
       R-squared for prediction.
    b. Use k-fold crossvalidation: split the data into k subsets. Fit a regression model 
       to each subset and use the rest of the data to compute R-squared for prediction.
    c. Jackknifing or leave-one-out crossvalidation.  Compute the PRESS statistic.
    
11. The logistic regression equation is
    log(p / (1 - p)) = b_0 + b_1 * x_1 + b_2 * x_2 + ... + b_p-1 x_p-1
    where the distribution is binomial and the link function is the logit function defined
    by logit(p) = log(p / (1 - p)).  Thus p, the probability of success, is determined by
    the settings of the independent variables.  A maximum likelihood algorithm is used to
    compute the estimated coefficients.
    
Part C: 

1. Model 2 and Model 3 are the best regression models. This is because the adjusted R-squared
   values are large and the VIF values are < 5.0.  I prefer Model 2 because it is more 
   parsimonious; its adjusted R2 value of 0.9733 is almost as large as the 0.9868 of Model 3.
   The residual plots for Model 2 show that the residuals are unbiased, homoscedastic, and
   normal.  R identifies three influence points.  These are due to the Cook's D statistic and
   the Cov Ratio, which we did not discuss.
   
2. To get yhat, substitute into the regression equation:
   yhat = -111.54 + 21.52 * x4 + 23.08 * x5 = -111.54 + 21.52 * (6000/1000) + 23.08 * (10,000/1000)
        = -111.54 + 21.52 * 6 + 23.08 * 10 = 248.38
   The predicted annual sales is 248.38 * 1000 = $248,380.