Stage 3

How to Combine Complicated Models with Tricky Effects

July 22nd, 2011 by

Need to dummy code in a Cox regression model?

Interpret interactions in a logistic regression?

Add a quadratic term to a multilevel model?

quadratic interaction plotThis is where statistical analysis starts to feel really hard. You’re combining two difficult issues into one.

You’re dealing with both a complicated modeling technique at Stage 3 (survival analysis, logistic regression, multilevel modeling) and tricky effects in the model (dummy coding, interactions, and quadratic terms).

The only way to figure it all out in a situation like that is to break it down into parts.  (more…)


Seven Ways to Make up Data: Common Methods to Imputing Missing Data

February 4th, 2009 by

There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you.

Most of the time, your software is choosing listwise deletion. Listwise deletion may or may not be a bad choice, depending on why and how much data are missing.

Another common approach among those who are paying attention is imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.

How do you choose that estimate?  The following are common methods: (more…)


Proportions as Dependent Variable in Regression–Which Type of Model?

January 26th, 2009 by

When the dependent variable in a regression model is a proportion or a percentage, it can be tricky to decide on the appropriate way to model it.

The big problem with ordinary linear regression is that the model can predict values that aren’t possible–values below 0 or above 1.  But the other problem is that the relationship isn’t linear–it’s sigmoidal.  A sigmoidal curve looks like a flattened S–linear in the middle, but flattened on the ends.  So now what?

The simplest approach is to do a linear regression anyway.  This approach can be justified only in a few situations.

1. All your data fall in the middle, linear section of the curve.  This generally translates to all your data being between .2 and .8 (although I’ve heard that between .3-.7 is better).  If this holds, you don’t have to worry about the two objections.  You do have a linear relationship, and you won’t get predicted values much beyond those values–certainly not beyond 0 or 1.

2. It is a really complicated model that would be much harder to model another way.  If you can assume a linear model, it will be much easier to do, say, a complicated mixed model or a structural equation model.  If it’s just a single multiple regression, however, you should look into one of the other methods.

A second approach is to treat the proportion as a binary response then run a logistic or probit regression.  This will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials.  For example, the proportion of land area covered with a certain species of plant would be hard to think of this way, but the proportion of correct answers on a 20-answer assessment would.

The third approach is to treat it the proportion as a censored continuous variable.  The censoring means that you don’t have information below 0 or above 1.  For example, perhaps the plant would spread even more if it hadn’t run out of land.  If you take this approach, you would run the model as a two-limit tobit model (Long, 1997).  This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

 


Poisson Regression Analysis for Count Data

December 31st, 2008 by

There are many dependent variables that no matter how many transformations you try, you cannot get to be normally distributed.  The most common culprits are count variables–the variable that measures the count or rate of some event in a sample.  Some examples I’ve seen from a variety of disciplines are:

Number of eggs in a clutch that hatch
Number of domestic violence incidents in a month
Number of times juveniles needed to be restrained during tenure at a correctional facility
Number of infected plants per transect

A common quality of these variables is that 0 is the mode–the most common value.  1 is the next most common, 2 the next, and so on.  In variables with low expected counts (number of cars in a household, number of degrees earned), (more…)