Do you find quizzes irresistible? I do.
Here’s a little quiz about working with missing data:
True or False?
1. Imputation is really just making up data to artificially inflate results. It’s better to just drop cases with missing data than to impute.
2. I can just impute the mean for any missing data. It won’t affect results, and improves power.
3. Multiple Imputation is fine for the predictor variables in a statistical model, but not for the response variable.
4. Multiple Imputation is always the best way to deal with missing data.
5. When imputing, it’s important that the imputations be plausible data points.
6. Missing data isn’t really a problem if I’m just doing simple statistics, like chi-squares and t-tests.
7. The worst thing that missing data does is lower sample size and reduce power.
Answers: (more…)
In my last post, I gave a little quiz about missing data. This post has the answers.
If you want to try it yourself before you see the answers, go here. (It’s a short quiz, but if you’re like me, you find testing yourself irresistible).
True or False?
1. Imputation is really just making up data to artificially inflate results. It’s better to just drop cases with missing data than to impute. (more…)
You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But the statistical results just look…wrong. Backwards, or even impossible—theoretically or logically.
This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.
Errors in Data Coding and Entry
In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which (more…)
Missing Data, and multiple imputation specifically, is one area of statistics that is changing rapidly. Research is still ongoing, and each year new findings on best practices and new techniques in software appear.
The downside for researchers is that some of the recommendations missing data statisticians were making even five years ago have changed.
Remember that there are three goals of multiple imputation, or any missing data technique: Unbiased parameter estimates in the final analysis (more…)
A new version of Amelia II, a free package for multiple imputation, has just been released today. Amelia II is available in two versions. One is part of R, and the other, AmeliaView, is a GUI package that does not require any knowledge of the R programming language. They both use the same underlying algorithms and both require having R installed.
At the Amelia II website, you can download Amelia II (did I mention it’s free?!), download R, get the very useful User’s Guide, join the Amelia listserve, and get information about multiple imputation.
If you want to learn more about multiple imputation:
The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis. It also goes under the names case deletion and complete case analysis.
Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations. By works well, I mean it fits 3 criteria:
– gives unbiased parameter estimates
– gives accurate (or at least conservative) standard error estimates
– results in adequate power.
But not always. So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data. Although each solved some problems of listwise deletion, they created others. All three have been discredited in recent years and should NOT be used. They are:
Pairwise Deletion: use the available data for each part of an analysis. This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.
Mean Imputation: substitute the mean of the observed values for all missing data. There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.
Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable. Use both variables in the analysis. While it does help the loss of power, it usually leads to biased results.
There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software. There is no reason to continue to use ad hoc techniques that create more problems than they solve.