One of the tricky parts about dummy coded (0/1) variables is keeping track of what’s a 0 and what’s a 1.
This is made particularly tricky because sometimes your software switches them on you.
Here’s one example in a question I received recently. The context was a Linear Mixed Model, but this can happen in other procedures as well.
I dummy code my categorical variables “0” or “1” but for some reason in the (more…)
Whenever I get email questions whose answers I think would benefit others, I like to answer them here. I leave out the asker’s name for privacy, but this is a great question about dummy coding:
First of all, thanks for all those helpful information you provided! Thanks sincerely for all your efforts!
Actually I am here to ask a technical question. See, I have 6 locations (let’s say A, B, C, D, E, and F), and I want to see the location effect on the outcome using OLS models.
I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location.
Then what if I put 6 dummies (for example, the 1st dummy would be “1” for A location, and “0” for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?
Thanks a lot!
Great question!
If you put in a 6th dummy code for Location A, your reference group, the model will actually blow up. (Yes, that’s a technical term).
This is one of those cases of pure multicollinearity, and the model can’t be estimated uniquely.
It’s the same situation you learned back in Algebra where you have two equations, one unknown. The problem isn’t that it can’t be solved–the problem is there are an infinite number of equally good solutions.
If an observation falls in Location A, the reference group, we’ve already gotten that information from the other 5 dummy variables. That observation would have a 0 on all of them. So we already know it’s location is A. We don’t need another dummy variable to tell the model that. It’s redundant information. And so perfectly redundant that the model will choke.
Dummy coding is one of the topics I get the most questions about. It can get especially tricky to interpret when the dummy variables are also used in interactions, so I’ve created some resources that really dig in deeply.
Yesterday I gave a little quiz about interpreting regression coefficients. Today I’m giving you the answers.
If you want to try it yourself before you see the answers, go here. (It’s truly little, but if you’re like me, you just cannot resist testing yourself).
True or False?
1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA. (more…)
Here’s a little tip.
When you construct Dummy Variables, make it easy on yourself to remember which code is which. Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results.
Make the codes inherent in the Dummy variable name.
So instead of a variable named Gender with values of 1=Female and 0=Male, call the variable Female.
Instead of a set of dummy variables named MaritalStatus1 with values of 1=Married and 0=Single, along with MaritalStatus2 with values 1=Divorced and 0=Single, name the same variables Married and Divorced.
And if you’re new to dummy coding, this has the extra bonus of making the dummy coding intuitive. It’s just a set of yes/no variables about all but one of your categories.
Most Multiple Imputation methods assume multivariate normality, so a common question is how to impute missing values from categorical variables.
Paul Allison, one of my favorite authors of statistical information for researchers, did a study that showed that the most common method actually gives worse results that listwise deletion. (Did I mention I’ve used it myself?) (more…)
I was recently asked about whether it’s okay to treat a likert scale as continuous as a predictor in a regression model. Here’s my reply. In the question, the researcher asked about logistic regression, but the same answer applies to all regression models.
1. There is a difference between a likert scale item (a single 1-7 scale, eg.) and a full likert scale , which is composed of multiple items. If it is a full likert scale, with a combination of multiple items, go ahead and treat it as numerical. (more…)