Warnings and Limitations when using Regression

Residual plots

The residuals are the differences between observed values and their predicted values. For a linear relationship (even with noise), the residual values should be consistent across the range of data. Graphing the residuals helps us qualitatively determine how well the data follows a linear relationship.

Even with good correlations, datasets with inconsistent residuals will inconsistently predict values of y. Often other relationship models will do a better job.

Lurking variables

Sometimes it is tempting to say that variation in the x-variable causes the variation in the y-variable. This is not necessarily the case. Sometimes there are lurking (hidden) variables that are the underlying cause.

Here's a data set with two variables: life expectancy and number of televisions per person. Does more TV increase life expectancy? What are the lurking variables?

Outliers and influential variables

Sometimes one outlier in a data is too influential. If the observation is questionable, it should be removed from the data.

Limited range of regression

The range of effective predictions do not extend the range of the observed data!

Last modified: Sun Apr 17 23:03:17 Central Daylight Time 2005