To Documents
The Dummy Variable Trap
and the matrix regression equation y = X β + ε would look like this:
[ 12.4 ] [ 1 1 0 0 ]
[ 20.7 ] [ 1 1 0 0 ] [ beta0 ]
[ 9.1 ] = [ 1 0 1 0 ] [ alpha1 ] + epsilon
[ 11.5 ] [ 1 0 1 0 ] [ alpha2 ]
[ 8.5 ] [ 1 0 0 1 ] [ alpha3 ]
[ 11.8 ] [ 1 0 0 1 ]
Notice that the first column of the X matrix (the intercept column) equals the sum of the three remaining columns, so the correlation of column 1
with a linear combination of columns 2, 3, and 4 is 1; we have perfect multicollinarity (the variance inflation factor is infinite),
we cannot invert the XT X matrix,
and therefore we cannot
solve for β^.
Another way to look at it is this. With three dummy variables dummy_a and dummy_b, we already have complete information about the categorical
effects. If the center is A or B there is a one for the corresponding dummy variable. If the center is C, then all of the dummy variables are zero.
To add an extra dummy variable for center C would not only be not necessary, it would be harmful because it would introduce perfect multicollinarity.
Try out this SAS code:
data wloss;
input dummy_a dummy_b dummy_c loss;
datalines;
1 0 0 12.4
1 0 0 10.7
0 1 0 9.1
0 1 0 11.5
0 0 1 8.5
0 0 1 11.8
;
proc reg;
model loss = dummy_a dummy_b dummy_c;
run;
quit;
You will receive this warning message:
NOTE: Model is not full rank. Least-squares solutions for the parameters
are not unique. Some statistics will be misleading. A reported DF
of 0 or B means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables
are a linear combination of other variables as shown.
dummy_c = Intercept - dummy_a - dummy_b
In other words, SAS recommends that the variable dummy_c be removed from the model,
which is what should have been done in the first place.
There are other methods for avoiding multicollinarity, but this method of only using
two dummy variables for three levels is simple and most often used.