Examining Regression for Multicollinearity Using Stata
Multicollinearity is a ubiquitous problem in regression analysis. It is something that we need to routinely examine in each model we run, since it's presence will produce results that lead us to make erroneous inferences with our hypothesis tests.
We will use the Presidential Approval data set from last years final for this analysis. Presapp.dta
The steps we will follow will be:
- Examine the correlation matrix for the data
- Run the regression analysis
- Examine the betas
- compare the t values and the F statistic
- Examine the Variance Inflation Factors (VIFs)
Examine the correlation matrix for the data
Correlations are easy to obtain. Get a correlation matrix of the entire data set with the following command
cor year-cpi
Which looks like this
The format is straightforward. Regress dependentVar IndependentVars
regress approval unemrate realgnp cpi
Some supplementary statistics are also available
regress approval unemrate realgnp cpi, b
produces betas (standardized regression coefficients).
Examine the Variance Inflation Factors (VIFs)
Simply type
vif
in the command line and get the following results.
Note that the 1/VIF column is the Tolerance.
The VIF ranges from 1.0 to infinity. VIFs greater than 10.0 are generally seen as indicative of severe multicolinearity. Tolerance ranges from 0.0 to 1.0, with 1.0 being the absence of multicolinearity.
Now using a 50 state dataset, try looking at the following model:regress rate96 urb96 urbrnk96 emprat96 emprnk96, b
where rate is the Crime rate, urb is urbanization, and emp is % of workforce employed. The rnk suffix is the state rank, rather than the raw data.
How does multicolinearity affect this model?