Examining Regression for Multicollinearity Using Stata


Multicollinearity is a ubiquitous problem in regression analysis. It is something that we need to routinely examine in each model we run, since it's presence will produce results that lead us to make erroneous inferences with our hypothesis tests.

We will use the Presidential Approval data set from last years final for this analysis. Presapp.dta

The steps we will follow will be:

Examine the correlation matrix for the data

Correlations are easy to obtain. Get a correlation matrix of the entire data set with the following command

cor year-cpi

Which looks like this

Run the Regression analysis

The format is straightforward. Regress dependentVar IndependentVars

regress approval unemrate realgnp cpi

Some supplementary statistics are also available

regress approval unemrate realgnp cpi, b

produces betas (standardized regression coefficients).

Examine the Variance Inflation Factors (VIFs)

Simply type


in the command line and get the following results.

Note that the 1/VIF column is the Tolerance.

The VIF ranges from 1.0 to infinity. VIFs greater than 10.0 are generally seen as indicative of severe multicolinearity. Tolerance ranges from 0.0 to 1.0, with 1.0 being the absence of multicolinearity.

Now using a 50 state dataset, try looking at the following model:

regress rate96 urb96 urbrnk96 emprat96 emprnk96, b

where rate is the Crime rate, urb is urbanization, and emp is % of workforce employed. The rnk suffix is the state rank, rather than the raw data.

How does multicolinearity affect this model?