Missing Data

Linear Mixed Models for Missing Data in Pre-Post Studies

August 30th, 2016 by

In the past few months, I’ve gotten the same question from a few clients about using linear mixed models for repeated measures data.  They want to take advantage of its ability to give unbiased results in the presence of missing data.  In each case the study has two groups complete a pre-test and a post-test measure.  Both of these have a lot of missing data.

The research question is whether the groups have different improvements in the dependent variable from pre to post test.

As a typical example, say you have a study with 160 participants.

90 of them completed both the pre and the post test.

Another 48 completed only the pretest and 22 completed only the post-test.

Repeated Measures ANOVA will deal with the missing data through listwise deletion. That means keeping only the 90 people with complete data.  This causes problems with both power and bias, but bias is the bigger issue.

Another alternative is to use a Linear Mixed Model, which will use the full data set.  This is an advantage, but it’s not as big of an advantage in this design as in other studies.

The mixed model will retain the 70 people who have data for only one time point.  It will use the 48 people with pretest-only data along with the 90 people with full data to estimate the pretest mean.

Likewise, it will use the 22 people with posttest-only data along with the 90 people with full data to estimate the post-test mean.

If the data are missing at random, this will give you unbiased estimates of each of these means.

But most of the time in Pre-Post studies, the interest is in the change from pre to post across groups.

The difference in means from pre to post will be calculated based on the estimates at each time point.  But the degrees of freedom for the difference will be based only on the number of subjects who have data at both time points.

So with only two time points, if the people with one time point are no different from those with full data (creating no bias), you’re not gaining anything by keeping those 72 people in the analysis.

Compare this to a study I also saw in consulting with 5 time points.  Nearly all the participants had 4 out of the 5 observations.  The missing data was pretty random–some participants missed time 1, others, time 4, etc.  Only 6 people out of 150 had full data.  Listwise deletion created a nightmare, leaving only 6 people in the data set.

Each person contributed data to 4 means, so each mean had a pretty reasonable sample size.  Since the missingness was random, each mean was unbiased.  Each subject fully contributed data and df to many of the mean comparisons.

With more than 2 time points and data that are missing at random, each subject can contribute to some change measurements.  Keep that in mind the next time you design a study.

 


Linear Regression in Stata: Missing Data and the Stories they Might Tell

May 18th, 2016 by

Stage 2

In a previous post , Using the Same Sample for Different Models in Stata, we examined how to use the same sample when comparing regression models. Using different samples in our models could lead to erroneous conclusions when interpreting results.

But excluding observations can also result in inaccurate results.

The coefficient for the variable “frequent religious attendance” was negative 58 in model 3 (more…)


Multiple Imputation for Missing Data: Indicator Variables versus Categorical Variables

February 25th, 2016 by

A data set can contain indicator (dummy) variables, categorical variables and/or both. Initially, it all depends upon how the data is coded as to which variable type it is.

For example, a categorical variable like marital status could be coded in the data set as a single variable with 5 values: (more…)


Missing Data Diagnosis in Stata: Investigating Missing Data in Regression Models

January 4th, 2016 by

In the last post, we examined how to use the same sample when running a set of regression models with different predictors.

Adding a predictor with missing data causes cases that had been included in previous models to be dropped from the new model.

Using different samples in different models can lead to very different conclusions when interpreting results.

Let’s look at how to investigate the effect of the missing data on the regression models in Stata.

The coefficient for the variable “frequent religious attendance” was negative 58 in model 3 and then rose to a positive 6 in model 4 when income was included. Results (more…)


R is Not So Hard! A Tutorial, Part 20: Useful Commands for Exploring Data

December 18th, 2015 by

Sometimes when you’re learning a new stat software package, the most frustrating part is not knowing how to do very basic things. This is especially frustrating if you already know how to do them in some other software.

Let’s look at some basic but very useful commands that are available in R.

We will use the following data set of tourists from different nations, their gender and numbers of children. Copy and paste the following array into R.

A <- structure(list(NATION = structure(c(3L, 3L, 3L, 1L, 3L, 2L, 3L,
1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L), .Label = c("CHINA",
"GERMANY", "FRANCE"), class = "factor"), GENDER = structure(c(1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L
), .Label = c("F", "M"), class = "factor"), CHILDREN = c(1L,
3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L
)), .Names = c("NATION", "GENDER", "CHILDREN"), row.names = 2:18, class = "data.frame")

Want to check that R read the variables correctly? We can look at the first 3 rows using the head() command, as follows:

head(A, 3)
  NATION   GENDER   CHILDREN
2 FRANCE      F        1
3 FRANCE      M        3
4 FRANCE      M        2

Now we look at the last 4 rows using the tail() command:

tail(A, 4)
    NATION   GENDER  CHILDREN
15  FRANCE      F        1
16  FRANCE      M        1
17  FRANCE      F        0
18 GERMANY      F        2

Now we find the number of rows and number of columns using nrow() and ncol().

nrow(A)
[1] 17

ncol(A)
[1] 3

So we have 17 rows (cases) and three columns (variables). These functions look very basic, but they turn out to be very useful if you want to write R-based software to analyse data sets of different dimensions.

Now let’s attach A and check for the existence of particular data.

attach(A)

As you may know, attaching a data object makes it possible to refer to any variable by name, without having to specify the data object which contains that variable.

Does the USA appear in the NATION variable? We use the any() command and put USA inside quotation marks.

any(NATION == "USA")
[1] FALSE

Clearly, we do not have any data pertaining to the USA.

What are the values of the variable NATION?

levels(NATION)
[1] "CHINA"   "GERMANY" "FRANCE"

How many non-missing observations do we have in the variable NATION?

length(NATION)
[1] 17

OK, but how many different values of NATION do we have?

length(levels(NATION))
[1] 3

We have three different values.

Do we have tourists with more than three children? We use the any() command to find out.

any(CHILDREN > 3)
[1] FALSE

None of the tourists in this data set have more than three children.

Do we have any missing data in this data set?

In R, missing data is indicated in the data set with NA.

any(is.na(A))
[1] FALSE

We have no missing data here.

Which observations involve FRANCE? We use the which() command to identify the relevant indices, counting column-wise.

which(A == "FRANCE")
[1]  1  2  3  5  7  9 10 14 15 16

How many observations involve FRANCE? We wrap the above syntax inside the length() command to perform this calculation.

length(which(A == "FRANCE"))
[1] 10

We have a total of ten such observations.

That wasn’t so hard! In our next post we will look at further analytic techniques in R.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

See our full R Tutorial Series and other blog posts regarding R programming.

 


When Does Repeated Measures ANOVA not work for Repeated Measures Data?

September 8th, 2014 by

Repeated measures ANOVA is the approach most of us learned in stats classes for repeated measures and longitudinal data. It works very well in certain designs.

But it’s limited in what it can do. Sometimes trying to fit a data set into a repeated measures ANOVA requires too much data gymnastics. (more…)