Estimating Models Using Dummy Variables
A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement. It takes on the values 1 and 0; 1 means something has an attribute or is true while 0 means something is not true. In regression model, if there are k categories, we would include k-1 dummy variables in the regression because any one dummy variable is perfectly collinear with remaining set of dummies (Angrist, 2001).
In reference to the General Social Survey dataset, I consider hours per day of watching TV as the independent variable and choose two variables as the predictor variables; race of the respondent and respondent’s sex. I then create dummy variables for respondent’s sex which has two categories- 1 for males and 2 for females. The research question for this analysis is;
“Is there a statistically significant effect on hours per day of watching TV by race of the respondent and respondent’s sex?”
Using SPSS software for analysis I first create the two dummy variables for the respondent’s sex variable, one for males and one for females. I omit the female category to be the comparison group (Allison, 2002).
The following tables show the output from the analysis, that is, the model summary, the ANOVA and the coefficients table.
|Model||R||R Square||Adjusted R Square||Std. Error of the Estimate|
|Sum of Squares||Df||Mean Square||F||Sig.|
|RACE OF RESPONDENT||.379||.098||.094||3.870||.000|
From the coefficients table, the multiple regression model becomes;
HOURS PER DAY OF WATCHING TV=2.426+0.103MALES+ 0.379 RACE OF THE RESPONDENT.
The model clearly implies that, holding all other factors constant the number of hours per day of watching TV is 2.426.Considering the race of the respondents, one unit increase in the race of the respondents leads to an increase in the number of hours of watching TV by 0.379 units. For the dummy variables, we male1 equals to 1 if the person is male and 0 if the person is female. So for males, and considering all other factors constant, the predicted number of hours of watching TV is 2.426+0.103(1) or 2.4363 and for females, the predicted number of hours of watching TV is 2.426+0.103(0) or 2.426. From the analysis therefore, we find that the race of the respondents have a statistically significant effect on the number of hours per day of watching TV (Angrist, 2001).
Multiple regression analysis makes several key assumptions that are important to adhere to in order to have proper interpretation of the models. These assumptions include linearity, independence of errors, homoscedasticity, multicollinearity, undue influence and normal distribution of errors. Running in regression analysis helps check if the model works well for the data. Running appropriate diagnostics for the model above, we have the following output results;
|Model||R||R Square||Adjusted R Square||Std. Error of the Estimate||Durbin-Watson|
The table above has the Durbin-Watson statistic which provides information about independence of errors. The value 1.974 which is approximately 2.0 indicates that there is absolutely no correlation between the residuals.
|Sum of Squares||df||Mean Square||F||Sig.|
From the ANOVA table, we see that our overall significant value is 0.000 which is significant.
|Standardized Coefficients||t||Sig.||Collinearity Statistics|
|RACE OF RESPONDENT||.379||.098||.094||3.870||.000||.999||1.001|
From the coefficients table, we have the variance inflation factor (VIF) which helps tell about multicollinearity. The VIF values as in the table are 1.001 for both the predictor variables. Since the value is below 10, we assume that the multicollinearity assumption was met in the model (Angrist, 2001).
|Std. Predicted Value||-.713||2.752||.000||1.000||1669|
|Standard Error of Predicted Value||.092||.189||.106||.026||1669|
|Adjusted Predicted Value||2.78||3.69||2.98||.249||1669|
|Stud. Deleted Residual||-1.427||8.358||.001||1.006||1669|
|Centered Leverage Value||.001||.005||.001||.001||1669|
The table above has values for Cook’s Distance which tells about undue influence. We see that the Cook’s Distance above ranges from minimum 0.000 to maximum 0.105, which is below 1.0, hence we can assume that we have no undue influence in this model.
The histogram above tells about distribution of errors. It’s clear that our distribution is fairly normal, hence, the assumption of normal distribution of errors is met or we can say that we do not have a significant deviation from the normal (Yuan & Lin, 2006).
The scatter plot above provides information about homoscedasticity. Since there is no discernible pattern with the spread of the scatter, the homoscedasticity assumption has been met by the model. The scatter plot also shows the there is a linear relationship (Angrist, 2001). Therefore, from the above analysis, all the assumptions were met by the regression model obtained (Allyson, 2002).
Allison, P. D. (2002). Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology, 55(1), 193-196.
Angrist, J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: simple strategies for empirical practice. Journal of business & economic statistics, 19(1), 2-28.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.