Examining Regression Diagnostics in Stata:
The Case of the West Palm Beach Ballot
Regression diagnostics provide us with a relatively powerful tool to examine how certain cases in our
data sets may influence our statistical models fit with the real world. We all know that outliers can occur which
simply do not fit the results, and to a lesser degree we can see that certain cases may unduly influence the fit
to the data. Regression diagnostics let us examine our date in light of each data point's influence, or lack there
of, on the statistical fit of the model.
While data sets abound with influeential observations or significant outliers, this instructional note will use the vote for Pat Buchanan in the State of Florida in the 2000 presidential election. This note does not test any particularly deep political questions, but simply seeks to assess to what degree did the infamous "butterfly ballot" affect the outcome of the votes in this highly contested presidential election.
There is a rapidly growing body of analysis on this question available. Please see Poly-Cy's Hot Topics page for links to many statistical analyses of this same question. The analysis provided here does not claim to be the definitive statistical model. In fact, the intent here is to simply use the data to explicate the use of regression diagnostics.
Using Stata to examine regression diagnostics is a relatively straightforward task. It requires a number of steps, but the process gives the analyist a feel for the data and how well the data fit the theory. Regression diagnostics help to strengthen an analysis by seeing which data points help the most to make the case, and which weaken the case the most. It is also quite possible for examination of the diagnostics to lead to theoretical insight. Once one knows which cases are influential, then understanding on why they are influential may follow.
The steps we will follow:
For this example we will use the Florida precinct totals for the 2000 Presidential election. This data is provided by Chris Caroll at Johns Hopkins University.
In case this data is no longer available, use the palmbeach.xls data set.
Block the entire data set, select Copy, and then Paste this text data set into an Excell spreadsheet. Then Use Text to Columns to convert the ascii text to column formated data. This will require one step for the variable labels and a second for the votes. Also note that if you use tab delimited, the counties with two names (e.g. ST LUCIE) will need adjusting for column misplacement.
Convert the Excel spreadsheet to a Stata data
set
This is easy. Use StatTransfer. Set the screen up as depicted below, using your correct folder and file names.

This data set is also available as a Stata datraset - palmbeach.dta.
Open up the PalmBeach Stata data set you have created and run a simple regression analysis. Calculate the total number of votes cast by summing the votes for each of the candidates on the ballot. This valiable will be an instrument for total number of voters or population.
gen voters = bush + gore + nader + buchanan + nelson + mccollum + logan
Then regress buchanon's vote on the vote for perot in 1996 and the total number of voters
reg buchanan perot96 voters
Note that the total number of voters is not signiticant, but the vote for Perot in 1996 is quite signiticant
Plot the predicted and residuals
Now let's examine the data. First, plot the actual versus the predicted values
predict yhat
plot buchanan yhat
This provides us with an interesting visual inspection

Examine some useful regression diagnostics.
Note that the relationship appears rather well fit except for one outlier - Palm Beach county. We need to assess the degree to which this visual inspection is supported by the regression diagnostics. We need to examine the diagnostics for both residual and influence.
There are a number of regression diagnostics that are useful.
These can all be obtained by using the predict command after the regression. You7 must assign a variable name to store the diagnostics in
Predict CooksD, CooksD
Predict diag2, leverage
Predict diag3, rstudent
Predict diag4, dfits
One additional useful diagnostic is the DFBetas, which indicates the degree to which each case influences each coeficient. Thus there is a DFBeta for each case for each independent variable in the model.
dfbeta
The DFBeta command will assign storage locations by prefixing DF to the variable name.
Upon calculating these diagnostics, use List to examine them.
list CooksD diag2 diag3 diag4 DFperot96 DFvoters
Two counties appear to have significant impact upon this model. Dade, the most populous, and Palm Beach. The reader is directed to reach his or her own conclusion about the appropriateness of the butterfly ballot.