To Projects
CSC 423 -- Project 3
Each problem is worth 5 points.
Part A. Banking Dataset
- Use the Banking Dataset banking.txt for this part. This dataset consists of data acquired from banking and census records for different zip codes in
the bank's current market. Such information can be useful in targeting advertising for new customers or for choosing locations for branch offices.
- The fields in the dataset:
- Median age of the population (Age)
- Median years of education (Education)
- Median income (Income) in $
- Median home value (HomeVal) in $
- Median household wealth (Wealth) in $
- Average bank balance (Balance) in $
- Goal: to define a regression model to predict the average bank balance as a function of the other variables.
- Problems
- Create and print a SAS dataset or R dataframe named Banking.
- Create scatterplots to visualize the associations between bank balance and the other five variables. Do the associations appear to be linear?
- Compute correlation values of bank balance vs the other variables. Interpret the correlation values. Which variables appear to be strongly associated.
- Fit a regression model of balance vs the other five variables. Write the expression of the estimated regression model.
- Are there any influence points for this model?
- Which of the five predictors have a significant effect on balance? (α=.05)
- A good model should only contain significant independent variables, so
remove the variable with the largest p-value (>0.05) and refit the regression
model of balance vs the remaining four predictors. Write down the expression of
the new regression model. Do NOT consider dropping more than
one insignificant variables at one time, but rather remove one variable at a
time. In fact, when one variable is removed from a regression model, it often
happens that non-significant variables in the original model become significant
in the reduced model.
- Analyze if all four predictors have a significant association with balance?
(α=.05)
If not continue to remove one insignificant variable at a time until all of the
remaining predictors are significant.
- Interpret each of the regression coefficients for the final model.
- Compute the standardized coefficients (SAS option stb in the model
statement). Discuss which variable has the strongest influence on balance?
- Discuss the coefficient of determination, R-squared for the final model.
- Discuss the five steps of the overall F-test for regression for the final
model.
- For the final model, create residual plots (r.*p. and r.*x1, ... r.*xn, where x1, ... , xn are
the independent variables) and the normal plot of the residuals.
Interpret these plots.
- Are there any influence points for your final regression model?
Part B. Lathe Dataset
- Use the dataset lathe.txt dataset to answer these questions:
- Create a SAS or R dataset and print it.
- Create a regression model for predicting hours from type and rpm. Use a dummy variable
for type.
- Create a scatterplot of hours vs. rpm, using the symbol 'A' or 'B', depending on the type.
See the Movies Example to see how to do this.
- Form the residual plot and the normal plot of the residuals.
- Does tool type has a significant effect on hours?
Part C. SalarySurvey Dataset
- Use the dataset salary-survey.txt to answer the following questions.
- Input the data from the salary-survey file. Create dummy variables to
represent educ and mgt. You should have two dummy variables for educ and one for mgt.
- Create a regression model that predicts salary from exper, educ, and mgt.
- Create the six pairwise scatterplots using the original variables of the
dataset? (Dummy variables don't usually make sense for scatterplots if
they are for a variable with more than one level.)
- Create the residual and normal plots of the residuals. For residual plots plot the residuals vs. predicted values and
the residuals vs. each independent variable.
- How much of an increase in salary is one additional year of experience
likely to produce?
- How much higher is the predicted salary of a college graduate than the salary
of a person with only a highschool degree?
- For a high school graduate with 3 years of experience with no management responsibilities, what is the
predicted salary?
- Does a person with an advanced degree have a higher predicted salary than a college graduate for this dataset?
- Find 95% prediction interval for the person in Problem 7.