8/2/17 Notes

CSC 423 -- 8/2/17

Review Exercises

What are two commonly used transformations for a regression model? What is the benefit of using these transformations?
Ans: Two commonly used transformations are the log and square root transforms, more generally a power transform, x → x^u, where 0 < u < 1. These transforms (a) change heteroscedastic residuals to make them more homoschedastic, and (b) reduce the effect of influence points.
Suppose you have a regression model defined in R by
or in SAS by
If all independent variables are held constant except for x₃ and x₃ increases by 5, how much does y increase in terms of the regression coefficient β₃?
Ans: y_new - y_orig =
(β₀ + β₁x₁ + β₂x₂ + β₃(x₃ + 5) + β₄x₄) - (β₀ + β₁x₁ + β₂x₂ + β₃x₃ + β₄x₄) =
β₃(x₃ + 5) - β₃x₃ = β₃x₃ + β₃5 - β₃x₃ = 5β₃
What is a degree of freedom?
Ans: Diagrams 1 and 2 illustrate the answer to this question. A degree of freedom is a dimension in the observation space (n dimensions), parameter space (p dimensions), or residual space (n-p dimensions).
What is the difference between a confidence interval and a prediction interval for predicted y-values?
Ans: A 95% confidence interval for the true value y contains y 95% of the time if the residuals are well behaved. A 95% prediction interval contains a new observation 95% of the time. The prediction interval is always wider than the confidence interval because the standard error of predicted values is always less than the standard deviation of individual observations. Also the confidence interval and the prediction intervals are always symmetric around the predicted value. --> Look at the Hamster Example for Prediction and Confidence Intervals.
Using the CrudeOil dataset crude-oil.txt, compute by hand a 90% confidence interval for the estimated regression parameter corresponding to pressure. Check your answer with SAS and R. Use the clb option for SAS and the confint function for R.
Ans: The estimated regression parameter for pressure is β₁ = 0.00956; its standard error is SE_β1 = 0.00191. Solve for β₁:
To compute this confidence interval using SAS, include the clb option in the model statement. Use the R function call confint(model) to compute confidence intervals for the estimated parameters.
Look at the Prediction Example. How can you use a regression equation to predict the value of the dependent variable for new settings of the independent variables? Using the CrudeOil dataset crude-oil.txt, what is the predicted value of recovery when pressure = 1700 and angle = 9?
Ans: The regression equation obtained from either SAS or R is
Substituting in the values of pressure = 1700 and angle = 9 into this regression equation gives
To obtain this predicted value with SAS, add this line to the end of the input dataset:
These are the desired settings for the independent variables with the dependent variable value set to missing ( . ).
To accomplish this in R, create a new dataframe called newdata with the desired settings of the independent variable:
If more than one setting is needed, pass in a vector for each column:
Then obtain the predicted values for these settings with
What do the SLENTRY and SLSTAY options do in proc reg with the stepwise selection option.
Ans: SLSTAY sets the threshold for backward selection. At each step, the independent variable whose regression parameter's p-value is the largest is thrown out. Only when all of the p-values are < SLSTAY does the backward selection stop.
SLENTRY sets the threshold for forward selection. At each step, the independent variable whose regression parameter's p-value is the smallest is added to the model. Only when all of the p-values are > SLSTAY does the backward selection stop.
What is k-fold crossvalidation? What is leave-one-out crossvalidation?
Ans: The dataset is randomly divided into k folds, each fold consisting of 1/k of the data. Fold 1 is taken to be the test set and the rest of the data is taken to be the training set. A regression model is obtained from the training set and tested on the test set. See ... for details. This process is repeated for Fold 2, Fold 3, ... , Fold k.

For the leave-one-out crossvalidation method, compute the PRESS statistic:
where y_(i)^{^} is the estimated value computed from the dataset with the ith point deleted. The PRESS statistic is good if PRESS < 1.5 SSE.

Show how to use R and SAS to split a dataset into training and test sets. Ans:

# R script.  Assume that the full dataframe is df.
# Suppose that you want to select one fourth of the 
# to be the test set.
 
n = nrow(df)
size = n / 4
set.seed(49143)
s = sample(n, size)
test = df[s, ]
training = df[-s, ]

* SAS source code. Assume that the full dataset is full_ds.
  Suppose that you want to select one third of the 
  to be the test set;

proc surveyselect data=full_ds method=srs seed=49143 outall
  samprate=0.25 out=subsets;

More about Model Validation

A commonly used jackknife statistic for validating a model is the PRESS statistic, which is the leave-one-out sum of squares:
y_(i)^{^} is the predicted value obtained from the regression model, which leaves out the ith observation.
SAS automatically computes PRESS when the r option is used for the proc reg model statement.
R computes the PRESS statistic with the PRESS function in the qpcR package. Supply the lm model object as the input argument to PRESS.
As a rule of thumb, the PRESS statistic indicates a good cross-validated regression model if PRESS < 1.5 or 2 SSE.
Look at the Crime Example. Find the best regression model. Find the PRESS statistic this best regression model using SAS and R.

Modify the R script of the Crime Example to work on the PropVal dataset propval.txt:

# Crime Example converted to work with propval.txt datafile.
library(leaps)

cat("PropVal dataset:\n")
propval = read.table("c:/datasets/propval.txt", 
                     header=T)
print(propval)

depvar = as.vector(propval$y)
predictors = as.matrix(propval[, c(-1, -2)])
out = leaps(x=predictors, y=depvar, 
wt=rep(1, NROW(predictors)), 
       int=TRUE, 
       method=c("adjr2"), 
       nbest=10, 
names=c("x2", "x3", "x4", "x5", "x6", 
        "x7", "x8", "x9"))
print(cbind(out$which, out$adjr2))

The GLMSelect Example uses 5-fold crossvalidation.

Here is an R example cv.lm function call. This example does not perform stepwise regression. The cv.lm function is found in the DAAG package:

# Install DAAG package if not installed.
install.packages("DAAG")

libary(DAAG)
model = lm(y ~ x1 + x2 + x3, data=mydata)

# Perform 3-fold cross-validation.
# Use seed to get the same result on every run.
cv.lm(mydata, model, m=3, seed=997)

If you are interested: the PropVal3 and PropVal4 examples contain custom code that is somewhat complicated, I can answer questions about these examples individually if you are interested. I recommend that you instead use the SAS glmselect or the R function cv.lm from the DAAG package to perform k-fold validation (see below).

Dummy Variables

One-way analysis of variance (ANOVA) is an extension of the independent two-sample t-test, assuming equal variance.
One-way ANOVA tests m groups to see if there is a significant difference between any of the groups using the overall F-test, the same F-test that we have already seen for regression.
WLoss3 Example
A dummy variable is an independent variable in a regression model that takes on only values 0 and 1.
Dummy variables are used to represent a categorical variable.
If a categorical variable has k levels, then k-1 dummy variables are needed.
Sales Example
Movies Example

Project 4

Look at the Project 4 Description.
CSC 324 students are not required to submit Project 4.

Topics in Categorical Data Analysis

Discuss sections 1, 5, 6, and 7.

A Quadratic Model

Sometimes a residual plot or partial residual plot will show a biased pattern where the expected values of the residuals follow a parabolic curve.
In this case, the fit of the model may be improved by including a quadratic term (x_j*x_j).
See the Olympic Example. The quadratic term fat * fat is added to the model. The variable fat represents the daily fat intake of olympic athletes

Interaction Terms

A commonly used regression model is the additive model:
If all independent variables are held constant except x_j, the change c in x_j produces the change c β_j in y.
Sometimes two or more independent variables interact with each other. In this case the amount of change in y due to a change in x_j (all other independent variables are held equal) may depend on the value of a third variable. There is called a two-way interaction between x_j and the third variable.
In the GFClocks Example, the dataset gfclocks.txt contains the prices at auction of grandfather clocks (dependent variable). The independent variables are the age of the clock and the number of bidders. There is a two-way interaction between the variables age and nbids. This is modeled by adding the term age * nbids.

Transformations

We didn't discuss this section in class.
If the residuals are heteroscedastic, one remedy is to use a data transformation of the dependent variable, less often the independent variable. Try one of these transformations in this order:
1. Log Transform: u = log(y) (only if y is positive)
2. Square Root Transform: u = sqrt(y) (only if y is non-negative)
3. Logit Transform: u = log(y / (1 - y)) (only if y is between 0 and 1)
Look at the residual plot of the transformed model to see if the residuals have become homoscedastic and normal.
Use a log or square root transform if a variable varies by many orders of magnitude from low values to high values.
For most multivariate regression models, we don't know the true regression equation; we can only obtain the estimated regression equation. The Pendulum Example is an interesting experiment because it is one of the few times that the true regression equation can be determined from a Physics formula.

Discuss the Pendulum Example. Here is a summary table of the SAS or R output:

Model	Independent Variable	Parameter Estimates	Standard Error	95% Confidence Interval
Linear	Intercept	0.42081	0.02494	(0.36842, 0.47320)
Linear	len	2.12796	0.08195	(1.95578, 2.30014)
Quadratic	Intercept	0.29426	0.01204	(0.26886, 0.31965)
	len	3.48676	0.10392	(3.26751, 3.70601)
	lenlen	-2.54743	0.18924	(-2.94670, -2.14816)
Square Root	Intercept	0.01165	0.00621	(-0.00141, 0.02471)
Square Root	sqrtlen	1.98756	0.01203	(1.96228, 2.01285)
Log-Log	Intercept	0.68005	0.00702	(0.66530, 0.69480)
Log-Log	loglen	0.48618	0.00402	(0.47773, 0.49463)

The physics formula for the period per of a pendulum with length len is

len / g

9.80665

The value of 2.00064 is within the 95% confidence interval for the parameter of sqrtlen. The true intercept 0 is also within the confidence interval for the Intercept parameter.

Now suppose we don't know that we need to take the square root in the physics formula. Use a log-log model (take logs of both the independent and dependent variables). Locate the confidence intervals for the Intercept and the parameter for loglen. They are (0.66530, 0.6948) and (0.47773, 0.49463). The true regression equation is obtained by taking the log of both sides of the physics equation above:

len / g

The true values of 0.69635 and 0.5 of the regression parameters are barely outside of the confidence intervals. Presumably, if the experiment were conducted with more precision, these true values would lie in the confidence intervals as well.

Transforms can also ameliorate the effects of influence points, which we will discuss later.
BodyBrain Example
What is the predicted brain weight for an animal with body weight 50 kg?
1. Use Model 1: the untransformed model.
2. Use Model 2: the log-log model.
Obtain a 95% confidence intervals for the predicted brain weight of a 100 kg animal using the log-log model of the BodyBrain Example.

The Durbin-Watson Test

We did not discuss this section in class.
The Durbin-Watson Test tests the null hypothesis that the residuals are uncorrelated.
See the SalesDW Example.