Estimating Models Using Dummy Variables

Estimating Models Using Dummy Variables

Name

School

Instructor

Date due

A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement. It takes on the values 1 and 0; 1 means something has an attribute or is true while 0 means something is not true. In regression model, if there are k categories, we would include k-1 dummy variables in the regression because any one dummy variable is perfectly collinear with remaining set of dummies (Angrist, 2001).

In reference to the General Social Survey dataset, I consider hours per day of watching TV as the independent variable and choose two variables as the predictor variables; race of the respondent and respondent’s sex. I then create dummy variables for respondent’s sex which has two categories- 1 for males and 2 for females. The research question for this analysis is;

“Is there a statistically significant effect on hours per day of watching TV by race of the respondent and respondent’s sex?”

Using SPSS software for analysis I first create the two dummy variables for the respondent’s sex variable, one for males and one for females. I omit the female category to be the comparison group (Allison, 2002).

The following tables show the output from the analysis, that is, the model summary, the ANOVA and the coefficients table.

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .096a .009 .008 2.577
Sum of Squares Df Mean Square F Sig.
1 Regression 102.885 2 51.443 7.746 .000b
Residual 11064.576 1666 6.641
Total 11167.461 1668
Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 2.426 .158   15.356 .000
male1 .103 .127 .020 .809 .418
RACE OF RESPONDENT .379 .098 .094 3.870 .000

From the coefficients table, the multiple regression model becomes;

HOURS PER DAY OF WATCHING TV=2.426+0.103MALES+ 0.379 RACE OF THE RESPONDENT.

The model clearly implies that, holding all other factors constant the number of hours per day of watching TV is 2.426.Considering the race of the respondents, one unit increase in the race of the respondents leads to an increase in the number of hours of watching TV by 0.379 units. For the dummy variables, we male1 equals to 1 if the person is male and 0 if the person is female. So for males, and considering all other factors constant, the predicted number of hours of watching TV is 2.426+0.103(1) or 2.4363 and for females, the predicted number of hours of watching TV is 2.426+0.103(0) or 2.426. From the analysis therefore, we find that the race of the respondents have a statistically significant effect on the number of hours per day of watching TV (Angrist, 2001).

Multiple regression analysis makes several key assumptions that are important to adhere to in order to have proper interpretation of the models. These assumptions include linearity, independence of errors, homoscedasticity, multicollinearity, undue influence and normal distribution of errors. Running in regression analysis helps check if the model works well for the data. Running appropriate diagnostics for the model above, we have the following output results;

Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
1 .096a .009 .008 2.577 1.974

The table above has the Durbin-Watson statistic which provides information about independence of errors. The value 1.974 which is approximately 2.0 indicates that there is absolutely no correlation between the residuals.

Sum of Squares df Mean Square F Sig.
1 Regression 102.885 2 51.443 7.746 .000b
Residual 11064.576 1666 6.641
Total 11167.461 1668

From the ANOVA table, we see that our overall significant value is 0.000 which is significant.

Standardized Coefficients t Sig. Collinearity Statistics
B Std. Error Beta     Tolerance VIF
1 (Constant) 2.426 .158   15.356 .000
male1 .103 .127 .020 .809 .418 .999 1.001
RACE OF RESPONDENT .379 .098 .094 3.870 .000 .999 1.001

From the coefficients table, we have the variance inflation factor (VIF) which helps tell about multicollinearity. The VIF values as in the table are 1.001 for both the predictor variables. Since the value is below 10, we assume that the multicollinearity assumption was met in the model (Angrist, 2001).

Minimum Maximum Mean Std. Deviation N
Predicted Value 2.80 3.67 2.98 .248 1669
Std. Predicted Value -.713 2.752 .000 1.000 1669
Standard Error of Predicted Value .092 .189 .106 .026 1669
Adjusted Predicted Value 2.78 3.69 2.98 .249 1669
Residual -3.665 21.092 .000 2.576 1669
Std. Residual -1.422 8.185 .000 .999 1669
Stud. Residual -1.426 8.191 .000 1.000 1669
Deleted Residual -3.685 21.124 .000 2.581 1669
Stud. Deleted Residual -1.427 8.358 .001 1.006 1669
Mahal. Distance 1.132 7.946 1.999 1.847 1669
Cook’s Distance .000 .105 .001 .004 1669
Centered Leverage Value .001 .005 .001 .001 1669

The table above has values for Cook’s Distance which tells about undue influence. We see that the Cook’s Distance above ranges from minimum 0.000 to maximum 0.105, which is below 1.0, hence we can assume that we have no undue influence in this model.

The histogram above tells about distribution of errors. It’s clear that our distribution is fairly normal, hence, the assumption of normal distribution of errors is met or we can say that we do not have a significant deviation from the normal (Yuan & Lin, 2006).

The scatter plot above provides information about homoscedasticity. Since there is no discernible pattern with the spread of the scatter, the homoscedasticity assumption has been met by the model. The scatter plot also shows the there is a linear relationship (Angrist, 2001). Therefore, from the above analysis, all the assumptions were met by the regression model obtained (Allyson, 2002).

References

Allison, P. D. (2002). Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology55(1), 193-196.

Angrist, J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: simple strategies for empirical practice. Journal of business & economic statistics19(1), 2-28.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology)68(1), 49-67.