Missing Data

Seven Steps for Data Cleaning

June 20th, 2024 by

Ever consider skipping the important step of cleaning your data? It’s tempting but not a good idea. Why? It’s a bit like baking.stage 1

I like to bake. There’s nothing nicer than a rainy Sunday with no plans, and a pantry full of supplies. I have done my shopping, and now it’s time to make the cake. Ah, but the kitchen is a mess. I don’t have things in order. This is no way to start.

First, I need to clear the counter, wash the breakfast dishes, and set out my tools. I need to take stock, read the recipe, and measure out my ingredients. Then it’s time for the fun part. I’ll admit, in my rush to get started I have at times skipped this step.

(more…)


Issues in Coding Missing Values

October 11th, 2023 by

There’s no mincing words here. Missing values can cause problems for every statistician. That’s true for a lot of reasons, but it can start with simple issues of choices stage 1made when coding missing values in a data set. Here are a few examples.

Example 1: The Null License Plate

Researcher Joseph Tartaro thought it would be funny to get the following California vanity license plate: (more…)


Confusing Statistical Term #13: Missing at Random and Missing Completely at Random

November 22nd, 2022 by

Stage 2One of the important issues with missing data is the missing data mechanism. You may have heard of these: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

The mechanism is important because it affects how much the missing data bias your results. This has a big impact on what is a reasonable approach to dealing with the missing data.  So you have to take it into account in choosing an approach.

The concepts of these mechanisms can be a bit abstract.missing data

And to top it off, two of these mechanisms have really confusing names: Missing Completely at Random and Missing at Random.

Missing Completely at Random (MCAR)

Missing Completely at Random is pretty straightforward.  What it means is what is (more…)


Best Practices for Data Preparation

October 4th, 2021 by

If you’ve been doing data analysis for long, you’ve probably had the ‘AHA’ moment where you realized statistical practice is a craft and not just a science. As with any craft, there are best practices that will save you a stage 1lot of pain and suffering and elevate the quality of your work. And yet, it’s likely that no one may have taught you these. I know I never had a class on this. (more…)


Multiple Imputation in a Nutshell

September 20th, 2021 by

Imputation as an approach to missing data has been around for decades.

stage-3

You probably learned about mean imputation in methods classes, only to be told to never do it for a variety of very good reasons. Mean imputation, in which each missing value is replaced, or imputed, with the mean of observed values of that variable, is not the only type of imputation, however. (more…)


Member Training: A (Gentle) Introduction to k-Nearest Neighbor

July 1st, 2021 by

Missing data is a common problem in data analysis. One of the successful approaches is k-Nearest Neighbor (kNN), a simple approach that leverages known information to impute unknown values with a relatively high degree of accuracy. (more…)