Dummy Variable Trap

The Dummy Variable Trap

Consider this modified dataset from the WLoss2 Example:

Center  Loss
  A     12.4
  A     10.7
  B      9.1
  B     11.5
  C      8.5
  C     11.8

Here suppose we use a dummy variable for each of the levels A, B, and C. (This is called the dummy variable trap.) The dataset would then look like this

dummy_a dummy_b dummy_c  Loss
   1       0       0     12.4
   1       0       1     10.7
   0       1       0      9.1
   0       1       0     11.5
   0       0       1      8.5
   0       0       1     11.8

and the matrix regression equation y = X β + ε would look like this:

[ 12.4 ]   [ 1 1 0 0 ]              
[ 20.7 ]   [ 1 1 0 0 ] [ beta0  ]
[  9.1 ] = [ 1 0 1 0 ] [ alpha1 ] + epsilon 
[ 11.5 ]   [ 1 0 1 0 ] [ alpha2 ] 
[  8.5 ]   [ 1 0 0 1 ] [ alpha3 ] 
[ 11.8 ]   [ 1 0 0 1 ]

Notice that the first column of the X matrix (the intercept column) equals the sum of the three remaining columns, so the correlation of column 1 with a linear combination of columns 2, 3, and 4 is 1; we have perfect multicollinarity (the variance inflation factor is infinite), we cannot invert the X^T X matrix, and therefore we cannot solve for β^.
Another way to look at it is this. With three dummy variables dummy_a and dummy_b, we already have complete information about the categorical effects. If the center is A or B there is a one for the corresponding dummy variable. If the center is C, then all of the dummy variables are zero. To add an extra dummy variable for center C would not only be not necessary, it would be harmful because it would introduce perfect multicollinarity.

Try out this SAS code:

data wloss;
   input dummy_a dummy_b dummy_c loss;
   datalines;
1 0 0 12.4
1 0 0 10.7
0 1 0  9.1
0 1 0 11.5
0 0 1  8.5
0 0 1 11.8
;

proc reg;
   model loss = dummy_a dummy_b dummy_c;

run;
quit;

You will receive this warning message:

NOTE: Model is not full rank. Least-squares solutions for the parameters 
      are not unique. Some statistics will be misleading. A reported DF 
      of 0 or B means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables 
      are a linear combination of other variables as shown.

             dummy_c = Intercept - dummy_a - dummy_b

In other words, SAS recommends that the variable dummy_c be removed from the model, which is what should have been done in the first place.
There are other methods for avoiding multicollinarity, but this method of only using two dummy variables for three levels is simple and most often used.