To Documents
External Model Validation
- To insure that a regression model is accurate for prediction, it is usually best to check it using external data that was not used to fit the model.
This is called external model validation.
- Here are three techniques of external model validation:
- Collecting New Data for Prediction: After the regression model has been fit to the original dataset with indices 1, 2, ..., n, collect new data
points n+1, n+2, ... , n+m. Compute
SST = Σi = n+1n+m
(yi - y)2, where y =
Σi = 1n
yi
and
SSE = Σi = n+1n+m
(yi - yi^)2
Then define R-square for prediction as R2predict = 1 - SSE/SST.
See Page 316 of Mendenhall and Sincich for more explanation.
A minimum of 15 to 20 new observations are required to test the validity of the model.
(You can also withhold 15 to 20 observations to form the test set and use the
other observations for the training set.)
- Data Splitting, also called Cross Validation: Split the data into k subsets. Fit a regression model to each subset and use
the rest of the data to compute R-squared for prediction. If the k models
agree, one is confident that the model is valid for data beyond the dataset used
to fit the model. If the models diverge, you have k different models.
- Jackknifing: Sometimes there is not enough data to partition the dataset into k subsets, even when k = 2.
Let y(i)^ denote the predicted value from the regression model based on the dataset with the ith
point deleted. Calculate SSE and SST from the terms yi - y(i)^
and y(i)^ - y,
respectively. The R-squared for the jackknife is defined as 1 - SSE/SST; the MSE for the jackknife.
The SSE computed in this manner is called the PRESS statistic: predicted
residual estimated sum of squares.