External Model Validation

To Documents

External Model Validation

To insure that a regression model is accurate for prediction, it is usually best to check it using external data that was not used to fit the model. This is called external model validation.
Here are three techniques of external model validation:
1. Collecting New Data for Prediction: After the regression model has been fit to the original dataset with indices 1, 2, ..., n, collect new data points n+1, n+2, ... , n+m. Compute
  Then define R-square for prediction as R²_predict = 1 - SSE/SST. See Page 316 of Mendenhall and Sincich for more explanation.
  
  A minimum of 15 to 20 new observations are required to test the validity of the model. (You can also withhold 15 to 20 observations to form the test set and use the other observations for the training set.)
2. Data Splitting, also called Cross Validation: Split the data into k subsets. Fit a regression model to each subset and use the rest of the data to compute R-squared for prediction. If the k models agree, one is confident that the model is valid for data beyond the dataset used to fit the model. If the models diverge, you have k different models.
3. Jackknifing: Sometimes there is not enough data to partition the dataset into k subsets, even when k = 2. Let y_(i)^ denote the predicted value from the regression model based on the dataset with the ith point deleted. Calculate SSE and SST from the terms y_i - y_(i)^ and y_(i)^ - y, respectively. The R-squared for the jackknife is defined as 1 - SSE/SST; the MSE for the jackknife. The SSE computed in this manner is called the PRESS statistic: predicted residual estimated sum of squares.