Estimating Models Using Dummy Variables

Name

School

Instructor

Date due

A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement. It takes on the values 1 and 0; 1 means something has an attribute or is true while 0 means something is not true. In regression model, if there are k categories, we would include k-1 dummy variables in the regression because any one dummy variable is perfectly collinear with remaining set of dummies (Angrist, 2001).

In reference to the General Social Survey dataset, I consider hours per day of watching TV as the independent variable and choose two variables as the predictor variables; race of the respondent and respondent’s sex. I then create dummy variables for respondent’s sex which has two categories- 1 for males and 2 for females. The research question for this analysis is;

“Is there a statistically significant effect on hours per day of watching TV by race of the respondent and respondent’s sex?”

Using SPSS software for analysis I first create the two dummy variables for the respondent’s sex variable, one for males and one for females. I omit the female category to be the comparison group (Allison, 2002).

The following tables show the output from the analysis, that is, the model summary, the ANOVA and the coefficients table.

Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.096a	.009	.008	2.577

Sum of Squares	Df	Mean Square	F	Sig.
1	Regression	102.885	2	51.443	7.746	.000b
	Residual	11064.576	1666	6.641
	Total	11167.461	1668

Standardized Coefficients	t	Sig.
B	Std. Error	Beta
1	(Constant)	2.426	.158		15.356	.000
	male1	.103	.127	.020	.809	.418
	RACE OF RESPONDENT	.379	.098	.094	3.870	.000

From the coefficients table, the multiple regression model becomes;

HOURS PER DAY OF WATCHING TV=2.426+0.103MALES+ 0.379 RACE OF THE RESPONDENT.

The model clearly implies that, holding all other factors constant the number of hours per day of watching TV is 2.426.Considering the race of the respondents, one unit increase in the race of the respondents leads to an increase in the number of hours of watching TV by 0.379 units. For the dummy variables, we male1 equals to 1 if the person is male and 0 if the person is female. So for males, and considering all other factors constant, the predicted number of hours of watching TV is 2.426+0.103(1) or 2.4363 and for females, the predicted number of hours of watching TV is 2.426+0.103(0) or 2.426. From the analysis therefore, we find that the race of the respondents have a statistically significant effect on the number of hours per day of watching TV (Angrist, 2001).

Multiple regression analysis makes several key assumptions that are important to adhere to in order to have proper interpretation of the models. These assumptions include linearity, independence of errors, homoscedasticity, multicollinearity, undue influence and normal distribution of errors. Running in regression analysis helps check if the model works well for the data. Running appropriate diagnostics for the model above, we have the following output results;

Model	R	R Square	Adjusted R Square	Std. Error of the Estimate	Durbin-Watson
1	.096a	.009	.008	2.577	1.974

The table above has the Durbin-Watson statistic which provides information about independence of errors. The value 1.974 which is approximately 2.0 indicates that there is absolutely no correlation between the residuals.

Sum of Squares	df	Mean Square	F	Sig.
1	Regression	102.885	2	51.443	7.746	.000b
	Residual	11064.576	1666	6.641
	Total	11167.461	1668

From the ANOVA table, we see that our overall significant value is 0.000 which is significant.

Standardized Coefficients	t	Sig.	Collinearity Statistics
B	Std. Error	Beta			Tolerance	VIF
1	(Constant)	2.426	.158		15.356	.000
	male1	.103	.127	.020	.809	.418	.999	1.001
	RACE OF RESPONDENT	.379	.098	.094	3.870	.000	.999	1.001

From the coefficients table, we have the variance inflation factor (VIF) which helps tell about multicollinearity. The VIF values as in the table are 1.001 for both the predictor variables. Since the value is below 10, we assume that the multicollinearity assumption was met in the model (Angrist, 2001).

	Minimum	Maximum	Mean	Std. Deviation	N
Predicted Value	2.80	3.67	2.98	.248	1669
Std. Predicted Value	-.713	2.752	.000	1.000	1669
Standard Error of Predicted Value	.092	.189	.106	.026	1669
Adjusted Predicted Value	2.78	3.69	2.98	.249	1669
Residual	-3.665	21.092	.000	2.576	1669
Std. Residual	-1.422	8.185	.000	.999	1669
Stud. Residual	-1.426	8.191	.000	1.000	1669
Deleted Residual	-3.685	21.124	.000	2.581	1669
Stud. Deleted Residual	-1.427	8.358	.001	1.006	1669
Mahal. Distance	1.132	7.946	1.999	1.847	1669
Cook’s Distance	.000	.105	.001	.004	1669
Centered Leverage Value	.001	.005	.001	.001	1669

The table above has values for Cook’s Distance which tells about undue influence. We see that the Cook’s Distance above ranges from minimum 0.000 to maximum 0.105, which is below 1.0, hence we can assume that we have no undue influence in this model.

The histogram above tells about distribution of errors. It’s clear that our distribution is fairly normal, hence, the assumption of normal distribution of errors is met or we can say that we do not have a significant deviation from the normal (Yuan & Lin, 2006).

The scatter plot above provides information about homoscedasticity. Since there is no discernible pattern with the spread of the scatter, the homoscedasticity assumption has been met by the model. The scatter plot also shows the there is a linear relationship (Angrist, 2001). Therefore, from the above analysis, all the assumptions were met by the regression model obtained (Allyson, 2002).

References

Allison, P. D. (2002). Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology, 55(1), 193-196.

Angrist, J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: simple strategies for empirical practice. Journal of business & economic statistics, 19(1), 2-28.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

Place an Order

Plagiarism Free!

Create an Account

Create an account at Top Tutor Online

Allows you to track orders.
Receive personal messages.
Send messages to a tutor.

Create Account

Post a Question/ Assignment

Post your specific assignment

Tutors will be notified of your assignment.
Review your question and include all the details.
A payment Link will be sent to you.

Post a Question

Wait for your Answer!

Make payment and wait for your answer

Make payment in accordance with the number of pages to be written.
Wait for your Answer as a professional works on your paper.
You will be notified when your Answer is ready.

💙🤍💚

Related Posts