To Notes

IT 224 -- May 8, 2024

Review Exercises

  1. The National Science Foundation collects data on research and development spending by universities and colleges in the U.S. (Moore, McCaib, Craig, Intro to the Practice of Statistics, W.H. Freedman, 2014). Here is the data for three years:
    Obs:    1    2    3
    Year: 2003 2006 2009
    Spending: 40.1 47.8 54.9

    Spending is in billions of dollars. These data values are slightly different than the values used in class on May 8. Answer these questions:
    1. Create a scatterplot of Spending vs. Year. Answer:
      > year <- c(2003, 2006, 2009)
      > spending <- c(40.1, 47.8, 54.9)
      > plot(year, spending, xlab="Year", ylab="Spending")
      
      Scatterplot of Spending by Year
    2. Use R to obtain x, y, SD+x, SD+y, and rxy Answer:
      > year
      [1] 2003 2006 2009
      > spending
      [1] 40.1 47.8 54.9
      > mean(year)
      [1] 2006
      > mean(spending)
      [1] 47.6
      > sd(year)
      [1] 3
      > sd(spending)
      [1] 7.402027
      > cor(year, spending)
      [1] 0.9997262
      
    3. Obtain the regression equation by hand using the statistics from Question 2b. When performing hand calculations, you can use R as a calculator. Verify your answer with the R lm function. Answer:
      The regression equation is
            y - y = (r * SD+y / SD+x)(x - x)
            y - 47.6 = (0.9997262 * 7.402027 / 3)(x - 2006)
            y - 47.6 = 2.466667 * (x - 2006)
            y = 2.466667 x - 2.466667 * 2006 + 47.6
            y = 2.466667 x - 4900.534
      To compute the regression equation using R, we first need to create a data frame containing the data. Then we use the lm function to obtain the regression model:
      > df <- data.frame(x=year, y=spending)
      > model <- lm(y ~ x, data=df)
      > print(model)
      
      Call:
      lm(formula = y ~ x, data = df)
      
      Coefficients:
      (Intercept) x 
      -4900.533 2.467 
      
    4. Compute the predicted values by hand. Check your answer using this R function call:
      > pred <- predict(model)
      
      Use the model obtained in Exercise 2c:
      y^1 = 2.467 * x1 - 4900.533 = 2.466667 * 2003 - 4900.534 = 40.2
      y^2 = 2.467 * x2 - 4900.533 = 2.466667 * 2006 - 4900.534 = 47.6
      y^3 = 2.467 * x3 - 4900.533 = 2.466667 * 2009 - 4900.534 = 55
      We can perform this calculation in one line using R:
       > 2.466667 * c(2003, 2006, 2009) - 4900.534
      [1] 40.2 47.6 55.0
      Check your answer using the R predict function, which obtains the predicted values from the model:
      > p <- predict(model)
         1    2    3 
      40.2 47.6 55.0
      
    5. Compute the residuals, which are computed as e^i = yi - y^i
            e^1 = y1 - y^1 = 40.1 - 40.2 = -0.1
            e^2 = y2 - y^1 = 47.8 - 47.6 = 0.2
            e^3 = y3 - y^1 = 54.9 - 55.0 = -0.1
      Check your answer using the R resid function, which computes the residuals from the model. Answer:
      > resid(model)
         1   2    3 
      -0.1 0.2 -0.1 
      
    6. Create the residual plot of residuals vs. predicted values. Answer:
      > plot(resid(model), predict(model), 
      + xlab="Predicted Values", ylab="Residuals", 
      + main="Residual Plot")
      
      Residual Plot
    7. Compute the normal scores by hand (n=3).
      Answer: the normal scores when n=3 divid the standard normal density into 4 equal areas or 25% each. The z-scores that do this are found at -0.67, 0.00, and 0.67. These can be found using the R qnorm function like this:
      > qnorm(c(0.25, 0.5, 0.75))
      [1] -0.6744898 0.0000000 0.6744898
      
    8. Create and normal plot for the residuals. Answer:
      > qqnorm(resid(model), main="Normal Plot of Residuals")
      
      Normal Plot
  2. What is the difference between the true regression equation and the estimated regression equation?
    Answer: recall the ideal measurement model: xi = μ + ei (actual measurement = true value + random error. μ is usually unknown and estimated by μ^= x.

    The true regression equation is yi = a * xi + b, where the slope a and the intercept b are unknown and must be estimated by a^ and b^, which are determined from the estimated regression equation:
          y - y = (r^ * SD+y / SD+x)(x - x)
    which is rewritten as
          y = a^ * x + b^.

Project 3

The Regression Fallacy

Additional Regression Problem

We will work this problem in class on Monday, May 13.