Karen Grace-Martin

Using Adjusted Means to Interpret Moderators in Analysis of Covariance

September 24th, 2010 by

Stage 2If you’re like most researchers, your statistical training focused on Regression or ANOVA, but not both. It all depends on whether your field focuses more on experimental data (Biology, Psychology) or observed data (Sociology, Economics). Maybe one class covered a bit of the other, but most people are comfortable in one, but not the other.

This, in my opinion, is a shame. (Okay, I was going to say tragedy, but let’s be real.  Tsunami that kills thousands=tragedy.  Different scale here).

First of all, the distinction between ANOVA and linear regression is arbitrary. They’re really the same model with different outfits on.

Second, regardless of which one you normally use, you’re going to occasionally have to use the other kind of predictor variables–categorical or continuous. And we can come up with nice names for these models–a regression with dummy variables or an Analysis of Covariance.

But real understanding of the relationships among variables comes only when you dispense of the names and can focus on analyzing and interpreting the model using the kinds of variables you have.

There are other examples, but today I’m going to focus on an ANOVA model with a continuous covariate.

A common model is one in which one predictor is categorical (we’ll use 4 categories) and the other is continuous. Here is an example of a scatterplot of just such a model:

Scatterplot of Ancova
Scatterplot of Ancova

There are four groups, each of which received a different training.  The continuous moderator is Age, and the outcome is OverallPost, which is the post-training test score to see how well they learned the material in each training program.

As you can see, the effect of the training program is moderated by age.  Another way to say that is there is a significant interaction between Age and Training Group.  The effect of the training is depending on the trainee’s age.

One way to interpret this significant interaction is to compare the slopes of the four lines, which is easily done with any regression coefficient table.  (Okay, not always easily done, but easily found in…)

But this doesn’t make very much sense when Age is really a moderator–a predictor we want to control for, and see how it affects the relationship between the independent (IV) and dependent variables (DV), but not really the IV we’re interested in.

A better way to do it in this situation is to compare the means among groups at a low value of Age, say 20, and again at a high value of Age, say 50.  You can get p-values, adjusted for multiple comparisons, using either SAS or SPSS GLM.

SAS Proc GLM uses the LSMeans statement and SPSS GLM uses EMMeans.  They do the same thing–calculate the mean of Y for each group, at a specific value of the covariate.

If you use the menus in SPSS, you can only get those EMMeans at the Covariate’s mean, which in this example is about 25, where the vertical black line is.  This isn’t very useful for our purposes.  But we can change the value of the covariate at which to compare the means using syntax.

So it would tell us that at a young age of say 20, the three treatment groups (green, tan, and purple lines) all have means higher than the control (blue).  Young people learned more in all three treatment groups.

But at an older age, say 50, the means of the purple and tan groups were not significantly different from the control group’s (blue), and the green  (EIQ group) did worse!

In SPSS GLM, the syntax would be:

UNIANOVA OverallPost BY group WITH NEWAGE
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(group) WITH(NEWAGE=MEAN) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=45) COMPARE ADJ(SIDAK)
/EMMEANS=TABLES(group) WITH(NEWAGE=20) COMPARE ADJ(SIDAK)
/PRINT=PARAMETER
/CRITERIA=ALPHA(.05)
/DESIGN=NEWAGE group NEWAGE*group.

 


About Dummy Variables in SPSS Analysis

September 7th, 2010 by

Whenever I get email questions whose answers I think would benefit others, I like to answer them here.  I leave out the asker’s name for privacy, but this is a great question about dummy coding:

First of all, thanks for all those helpful information you provided! Thanks sincerely for all your efforts!

Actually I am here to ask a technical question. See, I have 6 locations (let’s say A, B, C, D, E, and F), and I want to see the location effect on the outcome using OLS models.

I know that if I included 5 dummy location variables (6 locations in total, with A as the reference group) in 1 block of the regression analysis, the result would be based on the comparison with the reference location.

Then what if I put 6 dummies (for example, the 1st dummy would be “1” for A location, and “0” for otherwise) in 1 block? Will it be a bug? If not, how to interpret the result?

Thanks a lot!

Great question!

If you put in a 6th dummy code for Location A, your reference group, the model will actually blow up. (Yes, that’s a technical term).

This is one of those cases of pure multicollinearity, and the model can’t be estimated uniquely.

It’s the same situation you learned back in Algebra where you have two equations, one unknown.  The problem isn’t that it can’t be solved–the problem is there are an infinite number of equally good solutions.

If an observation falls in Location A, the reference group, we’ve already gotten that information from the other 5 dummy variables.  That observation would have a 0 on all of them.  So we already know it’s location is A.  We don’t need another dummy variable to tell the model that.  It’s redundant information.  And so perfectly redundant that the model will choke.

Dummy coding is one of the topics I get the most questions about.  It can get especially tricky to interpret when the dummy variables are also used in interactions, so I’ve created some resources that really dig in deeply.

 


The Distribution of Independent Variables in Regression Models

January 19th, 2010 by

Stage 2While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.

It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.

But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it (more…)


Confusing Statistical Terms #2: Alpha and Beta

December 11th, 2009 by

Oh so many years ago I had my first insight into just how ridiculously confusing all the statistical terminology can be for novices.

I was TAing a two-semester applied statistics class for graduate students in biology.  It started with basic hypothesis testing and went on through to multiple regression.

It was a cross-listed class, meaning there were a handful of courageous (or masochistic) undergrads in the class, and they were having trouble keeping (more…)


3 Situations When it Makes Sense to Categorize a Continuous Predictor in a Regression Model

July 24th, 2009 by

In many research fields, a common practice is to categorize continuous predictor variables so they work in an ANOVA. This is often done with median splits. This is a way of splitting the sample into two categories: the “high” values above the median and the “low” values below the median.

Reasons Not to Categorize a Continuous Predictor

There are many reasons why this isn’t such a good idea: (more…)


5 Ways to Increase Power in a Study

June 12th, 2009 by

To increase power:

  1. Increase alpha
  2. Conduct a one-tailed test
  3. Increase the effect size
  4. Decrease random error
  5. Increase sample size

Sound so simple, right?  The reality is that although these 5 ways all work (more…)