How to Diagnose the Missing Data Mechanism

One important consideration in choosing a missing data approach is the missing data mechanism—different approaches have different assumptions about the mechanism.

Each of the three mechanisms describes one possible relationship between the propensity of data to be missing and values of the data, both missing and observed.

The Missing Data Mechanisms

Missing Completely at Random, MCAR, means there is no relationship between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.

Missing at Random, MAR, means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.

Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR.

Missing Not at Random, MNAR, means there is a relationship between the propensity of a value to be missing and its values. This is a case where the people with the lowest education are missing on education or the sickest people are most likely to drop out of the study.

MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.

“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data.

Why you need to know the mechanism you have

Multiple imputation and Maximum Likelihood assume the data are at least missing at random. So the important distinction here is whether the data are MAR as opposed to MNAR.

Listwise deletion, however, requires the data are MCAR in order to not introduce bias in the results.

As long as the distribution and percentage of missing data is no so great that it negatively affects power, listwise deletion can be a good choice for MCAR missing data. So the important distinction here is whether the data are MCAR as opposed to MAR.

Keep in mind that in most data sets, more than one variable will have missing data, and they may not all have the same mechanism. It’s worthwhile diagnosing the mechanism for each variable with missing data before choosing an approach.

I use the term diagnosing rather than testing, because you’re not going to get a straight answer without knowing the values of the missing data. Of course, if you knew those, you wouldn’t be doing any of this.

It’s like checking for multicollinearity or testing assumptions. Each piece of information tells you something, but there is no definitive answer.

You have to get at the mechanism in a number of ways and then decide if making the assumption about the mechanism is reasonable.

Diagnosing the Mechanism

1. MAR vs. MNAR

The only true way to distinguish between MNAR and MAR is to measure some of that missing data. It’s a common practice among professional surveyors to, for example, follow-up on a paper survey with phone calls to a group of the non-respondents and ask a few key survey items. This allows you to compare respondents to non-respondents.

If their responses on those key items differ by very much, that’s good evidence that the data are MNAR.

However in most missing data situations, we don’t have the luxury of getting a hold of the missing data. So while we can’t test it directly, we can examine patterns in the data get an idea of what’s the most likely mechanism.

The first thing in diagnosing randomness of the missing data is to use your substantive scientific knowledge of the data and your field. The more sensitive the issue, the less likely people are to tell you. They’re not going to tell you as much about their cocaine usage as they are about their phone usage.

Likewise, many fields have common research situations in which non-ignorable data is common. Educate yourself in your field’s literature.

2. MCAR vs. MAR

There is a very useful test for MCAR, Little’s test. But like all tests of assumptions, it’s not definitive. So run it, but use it as only one piece of information.

A second technique is to create dummy variables for whether a variable is missing.

1 = missing
0 = observed

You can then run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables.

For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men.

The SPSS Missing Data module has a very nice procedure for doing this automatically–you don’t have to create all those dummy variables. I don’t know of other software packages having this built in, but it’s not hard to program.

 

Approaches to Missing Data: the Good, the Bad, and the Unthinkable
Learn the different methods for dealing with missing data and how they work in different missing data situations.

Reader Interactions

Comments

  1. Syn says

    Hi Karen

    Thanks for your informative post. I wonder if there is any source that discusses how missing values of different types might affect different estimation methods’ performance, especially in terms of asymptotic and finite sample properties? Or any specific statistical term I need, to search in the literature? (I am more interested in panel data estimation methods _ data with variation across cohorts and panels)
    Best
    Syn

  2. Kushi says

    Hi Karen
    Thank you for this wonderful platform to learn statistics. I am in a beginning stage of my data analysis and I shall be thankful if you could solve my queries:
    1. I have a big dataset with 7 variables and each variables consists of 5 items. My query is do I have to test little MCAR with each variables (5 items) separately and accordingly delete or impute missing values? or all the 35 items together?
    2. If suppose 2 of my variables have MCAR but other 5 diagnosed to be MAR. In that case what statistical technique should I use? Should I use separately for each variable or a common statistic should be applied for all the variables?

    Thank you in advance.

  3. hlu says

    Maybe I’m missing something here, but doesn’t the dummy variable technique only tell you if the mechanism is MAR or not MAR? If you have a strongly MNAR mechanism where the missingness depends solely on that feature (i.e., heavier people are less likely to report their weight). Wouldn’t that also show this dummy variable as unrelated to other variables too? Is there a way then to distinguish in this case MCAR from MNAR?

  4. Martin says

    I am using a data which is drawn from Compustat North America. Here I have some missing values. Is this always MCAR? The Little’s test suggests it is not, but rationally thinking: there is, as far to my knowledge, nothing systematic going on that makes some data more likely to be missing than others. It is just that the database is not complete.

    Am I right?

    • Karen Grace-Martin says

      Hi Martin,

      I don’t know about that data set. You can’t assume that any missing data are MCAR. You have to dig in a bit to understand why it’s missing. You may not always have that information if it’s someone else’s data so you should assume the least.

  5. DJ says

    Dear Karen,

    Thanks for these excellent pages on missing data and multiple imputation. This is all very new to me and I’m finding the statistical literature gets quite heavy, quite quickly, making it difficult to follow for any novice.

    I’ve played around a little bit with MI packages in R (such as Amelia II) so I’m fairly comfortable with creating the imputed datasets. But my biggest problem at the moment is in understanding whether or not this approach is appropriate for my data.

    I’m analysing regional records that I have acquired from a central government office. A small number of regions were not able (for unknown reasons) to return the requested data when the central office surveyed each region in 2012, as a consequence there is missing data for a small amount of government regions. There does not appear to be any systematic reason for each region not returning this data. Regions with missing data include a mix of affluent urban regions, deprived urban regions, as well as, affluent rural regions and deprived rural areas. The trouble I’m having is that I’m not sure these observations are sufficient to conclude that missingness is MAR as opposed to MNAR. If not, what does sufficient evidence to make this conclusion look like?

    This seems like an extremely important distinction for anyone considering MI techniques, yet a consideration that is not discussed in any great depth in the literature (at least not in a way I can understand). Most textbooks and online demonstrations warn against performing MI on missing data that is MNAR, but say little about how to judge this.

    In your post above you have offered one of the best explanations I have seen yet, but I was wondering if you could elaborate on this at all?

    Best wishes

    DJ


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.