Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.
The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.
Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis (regression coefficients, group means, odds ratios, etc.); accurate standard errors of those parameter estimates, and therefore, accurate p-values in the analysis; and adequate power to find meaningful parameter values significant.
So here are a few updates that will help you achieve these goals.
1. Don’t round off imputations for dummy variables. Many common imputation techniques, like MCMC, require normally distributed variables. Suggestions for imputing categorical variables were to dummy code them, impute them, then round off imputed values to 0 or 1. Recent research, however, has found that rounding off imputed values actually leads to biased parameter estimates in the analysis model. You actually get better results by leaving the imputed values at impossible values, even though it’s counter-intuitive.
2. Don’t transform skewed variables. Likewise, when you transform a variable to meet normality assumptions before imputing, you not only are changing the distribution of that variable but the relationship between that variable and the others you use to impute. Doing so can lead to imputing outliers, creating more bias than just imputing the skewed variable.
3. Use more imputations. The advice for years has been that 5-10 imputations are adequate. And while this is true for unbiasedness, you can get inconsistent results if you run the multiple imputation more than once. Bodner (2008) recommends having as many imputations as the percentage of missing data. Since running more imputations isn’t any more work for the data analyst, there’s no reason not to.
4. Create multiplicative terms before imputing. When the analysis model contains a multiplicative term, like an interaction term or a quadratic, create the multiplicative terms first, then impute. Imputing first, and then creating the multiplicative terms actually biases the regression parameters of the multiplicative term (von Hippel, 2009).
5. Alternatives to multiple imputation aren’t usually better. Multiple imputation assumes the data are missing at random. In most tests, if an assumption is not met, there are better alternatives—a nonparametric test or an alternative type of model. This is often not true with missing data. Alternatives like listwise deletion (a.k.a. ignoring it) have more stringent assumptions. So do nonignorable missing data techniques like Heckman’s selection models.
References:
Allison, Paul D. 2005. “Imputation of Categorical Variables with PROC MI,” Presented at
the 30th Meeting of SAS Users Group International, April 10–13, Philadephia, PA.
Bodner, T. E. 2008. “What Improves with Increased Missing Data Imputations?”
Structural Equation Modeling 15(4):651–75.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549-576.
von Hippel, P.T. (2009). “How To Impute Squares, Interactions, and Other Transformed Variables.” Sociological Methodology 39.
Shelmith says
Hi Karen,
Mice uses regression models to impute missing values. These models are used for prediction and thus the value of R-squared should be large. However, most of the models have a very small R-squared and are used for prediction. What’s your take on this?
What’s
titin says
I will work in R and use mice algorithme. I have read the journal of Ian White 2011, the number of imputation is the same with the number of proportion the missing value; But the statment can be used in MCAR mecanisme, what do you think about the MAR mecanisme, please?
Thank you very much,
Best regards,
Titin
Sonja says
Dear Karen,
von Hippel (2009) writes that his findings are true for parametric imputation methods. Do they also hold for nonparametric methods like imputation techniques based on decision trees like random forest?
Best,
Sonja
Alban says
I have conducted analysis for the missing data I have in my data set and it is missing at random. According to SPSS guidelines, if this is the case I have to use Multiple Imputation procedures following a Linear regression methodology to impute the data for the missing values. The SPSS derives 5 different values for each missing values and it generates a complete dataset with imputed values in five versions/imputations. However, it does not provide me with one single/pooled dataset that takes into account all imputations.
I just wanted to draw on your experience on this, do I simple take the average of these 5 different imputed values to generate the pooled value and have a complete dataset, or is there another procedure I need to be aware of? I know that SPSS does analysis based on pooled dataset as well as for each imputation separately. But I need the dataset to use it in other software like STATA or eViews. Please help on generating the final pooled dataset with all variables.
Do I simply use the following, for example:
Final Pooled Value = (Missing Value from Imputation 1 + Missing Value from Imputation 2 + Missing Value from Imputation 3 + ……… Missing Value from Imputation ‘n’ ) / Number of imputations
Kind regards,
Alban
Peter says
Dear Karen,
I am using longitudinal data for my project and was wondering whether it is possible to use later data in my MI model?
For example my mediator variable is self-esteem global score at age 8. Would I be able to include later measures of self-esteem in my model (e.g. self-esteem at 10 0r 12) or would I only be able to use self-esteem scores that precede the score at age 8?
I’m using SPSS v22.
Thanks.
Christina says
Dear Karen
First: calculating pooled Dataset => special algorythm?
I am using MI SPSS21 and want to do Analyses with PROCESS (Skript of Hayes) which is unfortunately not yet supporting an MI-dataset.
Is there a specific algorithm used to calculate the pooled results or may I simply calculate a mean of all my imputations for the specific variables in order to get a pooled dataset to go on using PROCESS?
Second: (similar then above) Is it ok to use a an ordinal variable which is not working in MI (e.g. kategories with only few number of cases) as scale-variable and even impute data as scale-variables
=> or would it be better (less biased) do define it as categorial dummies and only use it them as predictor-variables (and not imputing data because of problems with imputing dummies)?
Thank you!
Christina
Christina says
found it – thank you!
Stefan says
Hi Christina,
great that you found it – would you kindly share it with the rest of us?
Also – did you happen to find a recommendation for ordinal data? I have the same problem…
Thanks very much!
Stean
Christina says
Dear Karen
I’d like to know about the source of the recommendation of not to transform non normal distributed variables and calculate further analysis with negative or out of range values.
Do you have any specific source I could cyte?
Thank you!
Christina
Christina
Ron says
Hi Karen
for missing values in Likert type scales, shall I treat them as scale variables, or ordinal variables?
Because I have too many items, and each has 6 levels (1 to 6), so when I treated them as ordinal variables, and the multiple imputation in SPSS gave error like :
The imputation model for question1 contains more than 100 parameters. No missing values will be imputed. Reducing the number of effects in the imputation model, by merging sparse categories of categorical variables, changing the measurement level of ordinal variables to scale, removing two-way interactions, or specifying constraints on the roles of some variables, may resolve the problem. Alternatively increase the maximum number of parameters allowed on the MAXMODELPARAM keyword of the IMPUTE subcommand.
Execution of this command stops.
Is there any potential problem (bias) if I treat them as scale variables in the imputation?
Thank you
Ron
Karen says
Hi Ron,
That’s a great question.
You should be fine. SPSS will assume that scale variables follow a normal distribution. Even so, research has found that that normality assumption isn’t as important as previously thought.
Caitlin says
Thanks for the question Ron, and the reply Karen- I also have this problem so it’s really helpful.
I have a further question:
I have three likely-type scales that I want to impute data for. For one scale (to be entered as a covariant in my statistical analysis), littles mcar test was sig, indicating that the missing data (less than 0.05% on this scale) was not mcar. Littles mcar test was non-sig for the two other scales as well as for the dataset as a whole. Should I impute for each scale separately or for the dataset as a whole?
Kat says
Hi Karen,
I was just wondering whether there is a limit on the number of values that should be imputed for one participant. For example, I asked participants to fill in 5 questionnaires as am investigating relationships between 5 different constructs. Some participants stopped the study completely after the first or second questionnaire so I wasn’t going to impute values for their remaining constructs as the study was incomplete. Some, however, have maybe got a missing value for subscales within one or two questionnaires, but have generally continued until the end.
I was going to keep the participants with one incomplete questionnaire but then not impute any values for those who have two or more incomplete ones.
What do you think? Any help would be much appreciated as I am very new to it all.
Many thanks,
Kat
Jonathan Bartlett says
Hi Karen
Regarding 2) and 4), on imputing interaction terms and non-linear terms: von Hippels proposal to impute the interaction terms and non-linear terms as if they are just other variables works for linear regression under the missing completely at random assumption. But if the data are missing at random, or the model is not linear (e.g. logistic regression), it is not unbiased – see the article here by Seaman et al – http://www.biomedcentral.com/1471-2288/12/46
That is not to say the method should never be used – in that article we recommended that it best approach for linear regression, given what is currently possible in imputation software. But for models like logistic regression, as we showed in simulations, it can give quite badly biased estimates.
We have recently proposed a new approach for imputing covariates, which can (correctly) allow for interactions and non-linear terms in the substantive model (analysis model), and free Stata software is available. The (open access) paper describing the method is here
http://doi.org/10.1177/0962280214521348
and for the Stata software see http://www.missingdata.org.uk
Best wishes
Jonathan
Angelita says
Hi Karen,
Is there an upper limit for number of variables to be put into the imputation model? I want to impute missing data for a personality questionnaire (NEO-PI) that has 240 questions.
Will this take a very long time to impute?
Many thanks,
Angelita
Karen says
Hi Angelita,
There is, although I don’t remember the number off the top of my head. The Graham article listed above has a bit of detail about this. He suggests sometimes you need to impute the entire scale, rather than individual items. I would suggest looking at that article for details.
Kris says
I have a question relating to dummy variables. Should we create dummy variables, for example creating dummies for race (black, white, and so on), before or after imputation?
Thank you so much! Great article!
Karen says
Hi Kris,
Depending on the software you’re using, generally before. The imputations won’t be all 1s and 0s.
Kris says
Thanks so much, Karen. I use STATA. I’ve imputed data before, but it’s been a long while. I’m looking to impute data that I just acquired, because it’s missing too high of a percentage. I will also use dummy variables. Thank you for clarifying that dummy variables should be created before imputation. As a side note, I am very grateful I found this website. The information here has been very helpful.
Best,
Kris
Susanna says
Thank you very much for this article!
I have a question about the 1. point. I have Likert-Scala-questionnaires with missing data and want to usw MI. I thought a can just impute them after i declared them as continuous and round off the imputed data. I have to sum them up and build mean scores later for the different scales. If i don´t round them off, I would get “impossible” mean scale scores, wouldn´t I?
(i´m validating one of the questionnaires, so I´m analyzing correlations and doing a factor analysis later)
Thank you!!!
Karen says
Hi Susanna,
You don’t want to round off your imputations. It’s actually okay to have “impossible” values. If you round off, you will not get the benefit of the random error that is being created by having multiple imputations. Remember the point is to have accurate parameter estimates, not accurate data.
Tobias says
Dear Karen,
thank you a lot for the helpful article. I have a question on point 2 (not transformating skewed variables): I am working with data on aid disbursements, where ~41% of the data are missing. As you can imagine, aid disbursements are heavily skewed, meaning i get a lot of implausible negative values in my imputed data (the overall means of the imputed data are ok though). If I logtransform aid disbursements (and consequently aid commitments, which are in the imputation model), then I get no negative values. Do you still think it would be better to accept the implausible negative values in the not-transformed imputed data because these imputations should be less biased overall?
Many thanks in advance!
Tobias
Karen says
Hi Tobias,
I know it seems strange to have implausible imputed values, but the important goal is to have good parameter estimates (regression coefficients and means) and standard errors.
In many situations, having implausible values gives better estimates.
Prasad says
Hi Karen,
I’m performing regression analysis on a dataset of > 500,000 indiviudals. The data for one of the improtant covariates is missing ( not available) (>50%). The values for all other factors and covariates included in the model is complete.I understand the LR will exclude missing values and perform the analyses, but the GLR will account for all the individuals and perform the analysis.
What steps should i take to account for the missing data? and how ? I’m using SPSS version 20.
Thanks
Prasad
Karen says
Hi Prasad,
That is a really big question, and depends on many issues in your design. I hesitate to give advice on how to analyze without knowing all the details, and in complicated situations like yours, it often takes trying a few things.
I have a webinar coming up, which is an overview of the approaches to missing data. You may find that helpful to get you started. Otherwise, I would suggest talking to a statistical consultant.
https://www.theanalysisfactor.com/approaches-to-missing-data/
Karen
Swarna says
Hi Karen,
Very informative article. I do understand the part about imputing multiplicative terms but say you have a variable like BMI (along with 25 other variables). BMI and all others have a little portion missing. In my analysis model we want to use SD scores instead of BMI but SD scores have to calculated using a specific formula which takes into account age, sex and the BMI value to calculate the SD score. My take on this is that I should impute the BMI first then calculate the SD scores using the specific method. If i were to impute something like SD scores, the imputed values would not take into account age and sex and bmi the way they should. Am I right?
Many thanks in advance.
Karen says
Hi Swarna,
Depending on how much of the data are missing, it may not matter. The article I referenced really only talks about multiplicative terms, like interactions, when the component variables are also in the model. I’m not sure how your SD scores are calculated, and it sounds like you won’t include BMI as well.
Graham has an article that discusses imputing scale scores–for example the total score out of 20 when some of the 20 items are missing. He says it’s best to impute the individual items if possible, but often there are just too many variables, especially in a data set with more than one scale. In that case you have to just use impute the total score, or average around it, which works in some situations.
So I would say it depends on your formula, and which situation yours is most like. It sounds like the latter, in which case I agree–impute BMI, age, sex, etc., then calculate the SD scores.
Karen
João says
Hello Karen,
Thank you for the insights. I have a unbalanced panel data where I can assume MAR, and I want to perform principal components analysis after doing the imputation. My concern is that I want to interact these principal components and hence I cannot do a previous interaction term, and I was wondering if your “update nº 4” would apply also to a situation of this kind.
Thanks!
João
Karen says
Hi Joao,
I would think it would. I would suggest reading the von Hippel article to get more information, though. It’s hard for me to say what you should do without digging into the data.
Karen
Lily says
Hi Karen,
Can you let me know where you found your reference for not transforming skewed variables or transforming variables prior to using multiple imputation?
Many thanks!
Karen says
Hi Lily,
It’s the von Hippel reference at the bottom of the page.
Karen
Karen says
You’re welcome Aaron. Glad you found it helpful.
Karen
Aaron says
Thanks for this article – you answered two of my imputation question (i.e., imputing interactions a priori or after imputation of the main effects; and whether to transform skewed data) in a very straightforward manner. Thanks for the references to these articles.