﻿ Project 3

## CSC 423 -- Project 3

Each problem is worth 5 points.

### Part A. Banking Dataset

• Use the Banking Dataset banking.txt for this part. This dataset consists of data acquired from banking and census records for different zip codes in the bank's current market. Such information can be useful in targeting advertising for new customers or for choosing locations for branch offices.

• The fields in the dataset:
1. Median age of the population (Age)
2. Median years of education (Education)
3. Median income (Income) in \$
4. Median home value (HomeVal) in \$
5. Median household wealth (Wealth) in \$
6. Average bank balance (Balance) in \$

• Goal: to define a regression model to predict the average bank balance as a function of the other variables.

• Problems

1. Create and print a SAS dataset or R dataframe named Banking.

2. Create scatterplots to visualize the associations between bank balance and the other five variables. Do the associations appear to be linear?

3. Compute correlation values of bank balance vs the other variables. Interpret the correlation values. Which variables appear to be strongly associated.

4. Fit a regression model of balance vs the other five variables. Write the expression of the estimated regression model.

5. Are there any influence points for this model?

6. Which of the five predictors have a significant effect on balance? (α=.05)

7. A good model should only contain significant independent variables, so remove the variable with the largest p-value (>0.05) and refit the regression model of balance vs the remaining four predictors. Write down the expression of the new regression model.    Do NOT consider dropping more than one insignificant variables at one time, but rather remove one variable at a time. In fact, when one variable is removed from a regression model, it often happens that non-significant variables in the original model become significant in the reduced model.

8. Analyze if all four predictors have a significant association with balance? (α=.05)   If not continue to remove one insignificant variable at a time until all of the remaining predictors are significant.

9. Interpret each of the regression coefficients for the final model.

10. Compute the standardized coefficients (SAS option stb in the model statement).  Discuss which variable has the strongest influence on balance?

11. Discuss the coefficient of determination, R-squared for the final model.

12. Discuss the five steps of the overall F-test for regression for the final model.

13. For the final model, create residual plots (r.*p. and r.*x1, ... r.*xn, where x1, ... , xn are the independent variables) and the normal plot of the residuals. Interpret these plots.

14. Are there any influence points for your final regression model?

### Part B. Lathe Dataset

• Use the dataset lathe.txt dataset to answer these questions:

1. Create a SAS or R dataset and print it.

2. Create a regression model for predicting hours from type and rpm. Use a dummy variable for type.

3. Create a scatterplot of hours vs. rpm, using the symbol 'A' or 'B', depending on the type.  See the Movies Example to see how to do this.

4. Form the residual plot and the normal plot of the residuals.

5. Does tool type has a significant effect on hours?

### Part C. SalarySurvey Dataset

• Use the dataset salary-survey.txt to answer the following questions.

1. Input the data from the salary-survey file. Create dummy variables to represent educ and mgt. You should have two dummy variables for educ and one for mgt.

2. Create a regression model that predicts salary from exper, educ, and mgt.

3. Create the six pairwise scatterplots using the original variables of the dataset?  (Dummy variables don't usually make sense for scatterplots if they are for a variable with more than one level.)

4. Create the residual and normal plots of the residuals. For residual plots plot the residuals vs. predicted values and the residuals vs. each independent variable.

5. How much of an increase in salary is one additional year of experience likely to produce?

6. How much higher is the predicted salary of a college graduate than the salary of a person with only a highschool degree?

7. For a high school graduate with 3 years of experience with no management responsibilities, what is the predicted salary?

8. Does a person with an advanced degree have a higher predicted salary than a college graduate for this dataset?

9. Find 95% prediction interval for the person in Problem 7.