To Lecture Notes

CSC 423 -- 8/2/17

Review Exercises

  1. What are two commonly used transformations for a regression model? What is the benefit of using these transformations?
    Ans: Two commonly used transformations are the log and square root transforms, more generally a power transform, x → xu, where 0 < u < 1. These transforms (a) change heteroscedastic residuals to make them more homoschedastic, and (b) reduce the effect of influence points.
     
  2. Suppose you have a regression model defined in R by
    or in SAS by
    If all independent variables are held constant except for x3 and x3 increases by 5, how much does y increase in terms of the regression coefficient β3?
    Ans: ynew - yorig =
    0 + β1x1 + β2x2 + β3(x3 + 5) + β4x4) - (β0 + β1x1 + β2x2 + β3x3 + β4x4) =
    β3(x3 + 5) - β3x3 = β3x3 + β35 - β3x3 = 5β3
     
  3. What is a degree of freedom?
    Ans: Diagrams 1 and 2 illustrate the answer to this question. A degree of freedom is a dimension in the observation space (n dimensions), parameter space (p dimensions), or residual space (n-p dimensions).
     
  4. What is the difference between a confidence interval and a prediction interval for predicted y-values?
    Ans: A 95% confidence interval for the true value y contains y 95% of the time if the residuals are well behaved. A 95% prediction interval contains a new observation 95% of the time. The prediction interval is always wider than the confidence interval because the standard error of predicted values is always less than the standard deviation of individual observations. Also the confidence interval and the prediction intervals are always symmetric around the predicted value. -->  Look at the Hamster Example for Prediction and Confidence Intervals.
     
  5. Using the CrudeOil dataset crude-oil.txt, compute by hand a 90% confidence interval for the estimated regression parameter corresponding to pressure. Check your answer with SAS and R. Use the clb option for SAS and the confint function for R.
    Ans: The estimated regression parameter for pressure is β1 = 0.00956; its standard error is SEβ1 = 0.00191. Solve for β1:
    To compute this confidence interval using SAS, include the clb option in the model statement. Use the R function call confint(model) to compute confidence intervals for the estimated parameters.
     
  6. Look at the Prediction Example. How can you use a regression equation to predict the value of the dependent variable for new settings of the independent variables?  Using the CrudeOil dataset crude-oil.txt, what is the predicted value of recovery when pressure = 1700 and angle = 9?
    Ans: The regression equation obtained from either SAS or R is
    Substituting in the values of pressure = 1700 and angle = 9 into this regression equation gives
    To obtain this predicted value with SAS, add this line to the end of the input dataset:
    These are the desired settings for the independent variables with the dependent variable value set to missing ( . ).
    To accomplish this in R, create a new dataframe called newdata with the desired settings of the independent variable:
    If more than one setting is needed, pass in a vector for each column:
    Then obtain the predicted values for these settings with
  7. What do the SLENTRY and SLSTAY options do in proc reg with the stepwise selection option.
    Ans: SLSTAY sets the threshold for backward selection. At each step, the independent variable whose regression parameter's p-value is the largest is thrown out. Only when all of the p-values are < SLSTAY does the backward selection stop.
    SLENTRY sets the threshold for forward selection. At each step, the independent variable whose regression parameter's p-value is the smallest is added to the model. Only when all of the p-values are > SLSTAY does the backward selection stop.
     
  8. What is k-fold crossvalidation? What is leave-one-out crossvalidation?
    Ans: The dataset is randomly divided into k folds, each fold consisting of 1/k of the data. Fold 1 is taken to be the test set and the rest of the data is taken to be the training set. A regression model is obtained from the training set and tested on the test set. See ... for details. This process is repeated for Fold 2, Fold 3, ... , Fold k.
     
    For the leave-one-out crossvalidation method, compute the PRESS statistic:
    where y(i)^ is the estimated value computed from the dataset with the ith point deleted. The PRESS statistic is good if PRESS < 1.5 SSE.
     
  9. Show how to use R and SAS to split a dataset into training and test sets. Ans:

 

More about Model Validation

 

Dummy Variables

 

Project 4

 

Topics in Categorical Data Analysis

 

A Quadratic Model

 

Interaction Terms

 

Transformations

 

The Durbin-Watson Test