There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you.
Most of the time, your software is choosing listwise deletion. Listwise deletion may or may not be a bad choice, depending on why and how much data are missing.
Another common approach among those who are paying attention is imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.
How do you choose that estimate? The following are common methods:
Mean imputation
Simply calculate the mean of the observed values for that variable for all individuals who are non-missing.
It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.
Substitution
Impute the value from a new individual who was not selected to be in the sample.
In other words, go find a new subject and use their value instead.
Hot deck imputation
A randomly chosen value from an individual in the sample who has similar values on other variables.
In other words, find all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable.
One advantage is you are constrained to only possible values. In other words, if Age in your study is restricted to being between 5 and 10, you will always get a value between 5 and 10 this way.
Another is the random component, which adds in some variability. This is important for accurate standard errors.
Cold deck imputation
A systematically chosen value from an individual who has similar values on other variables.
This is similar to Hot Deck in most ways, but removes the random variation. So for example, you may always choose the third individual in the same experimental condition and block.
Regression imputation
The predicted value obtained by regressing the missing variable on other variables.
So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values.
Stochastic regression imputation
The predicted value from a regression plus a random residual value.
This has all the advantages of regression imputation but adds in the advantages of the random component.
Most multiple imputation is based off of some form of stochastic regression imputation.
Interpolation and extrapolation
An estimated value from other observations from the same individual. It usually only works in longitudinal data.
Use caution, though. Interpolation, for example, might make more sense for a variable like height in children–one that can’t go back down over time. Extrapolation means you’re estimating beyond the actual range of the data and that requires making more assumptions that you should.
Single or Multiple Imputation?
There are two types of imputation–single or multiple. Usually when people talk about imputation, they mean single.
Single refers to the fact that you come up with a single estimate of the missing value, using one of the seven methods listed above.
It’s popular because it is conceptually simple and because the resulting sample has the same number of observations as the full data set.
Single imputation looks very tempting when listwise deletion eliminates a large portion of the data set.
But it has limitations.
Some imputation methods result in biased parameter estimates, such as means, correlations, and regression coefficients, unless the data are Missing Completely at Random (MCAR). The bias is often worse than with listwise deletion, the default in most software.
The extent of the bias depends on many factors, including the imputation method, the missing data mechanism, the proportion of the data that is missing, and the information available in the data set.
Moreover, all single imputation methods underestimate standard errors.
Since the imputed observations are themselves estimates, their values have corresponding random error. But when you put in that estimate as a data point, your software doesn’t know that. So it overlooks the extra source of error, resulting in too-small standard errors and too-small p-values.
And although imputation is conceptually simple, it is difficult to do well in practice. So it’s not ideal but might suffice in certain situations.
So multiple imputation comes up with multiple estimates. Two of the methods listed above work as the imputation method in multiple imputation–hot deck and stochastic regression.
Because these two methods have a random component, the multiple estimates are slightly different. This re-introduces some variation that your software can incorporate in order to give your model accurate estimates of standard error.
Multiple imputation was a huge breakthrough in statistics about 20 years ago. It solves a lot of problems with missing data (though, unfortunately not all) and if done well, leads to unbiased parameter estimates and accurate standard errors.
Kamesh says
If the dataset is sufficiently large, can we use machine learning algorithm based imputation. There are standard packages available; but perhaps these algorithms may also be taking regression-based techniques albeit in multiple ways.
Carolina says
Where does full information maximum likelihood fit into this discussion and how does it compare to the above missing data methods?
Karen Grace-Martin says
Carolina,
Full information maximum likelihood is an alternate to all of these imputation methods. It’s generally considered as good as multiple imputation, but they both have strengths and weaknesses in certain situations, so it depends on the specific context.
See: Two Recommended Solutions for Missing Data: Multiple Imputation and Maximum Likelihood
ALIZA says
kindly tell me the procedure of interpolation and extrapolation.
thank you