OptinMon 30 - Four Critical Steps in Building Linear Regression Models

Using Adjusted Means to Interpret Moderators in Analysis of Covariance

September 24th, 2010 by

Stage 2If you’re like most researchers, your statistical training focused on Regression or ANOVA, but not both. It all depends on whether your field focuses more on experimental data (Biology, Psychology) or observed data (Sociology, Economics). Maybe one class covered a bit of the other, but most people are comfortable in one, but not the other.

This, in my opinion, is a shame. (Okay, I was going to say tragedy, but let’s be real.  Tsunami that kills thousands=tragedy.  Different scale here).

First of all, the distinction between ANOVA and linear regression is arbitrary. They’re really the same model with different outfits on.

Second, regardless of which one you normally use, you’re going to occasionally have to use the other kind of predictor variables–categorical or continuous. And we can come up with nice names for these models–a regression with dummy variables or an Analysis of Covariance.

But real understanding of the relationships among variables comes only when you dispense of the names and can focus on analyzing and interpreting the model using the kinds of variables you have.

There are other examples, but today I’m going to focus on an ANOVA model with a continuous covariate.

A common model is one in which one predictor is categorical (we’ll use 4 categories) and the other is continuous. Here is an example of a scatterplot of just such a model:

Scatterplot of Ancova
Scatterplot of Ancova

There are four groups, each of which received a different training.  The continuous moderator is Age, and the outcome is OverallPost, which is the post-training test score to see how well they learned the material in each training program.

As you can see, the effect of the training program is moderated by age.  Another way to say that is there is a significant interaction between Age and Training Group.  The effect of the training is depending on the trainee’s age.

One way to interpret this significant interaction is to compare the slopes of the four lines, which is easily done with any regression coefficient table.  (Okay, not always easily done, but easily found in…)

But this doesn’t make very much sense when Age is really a moderator–a predictor we want to control for, and see how it affects the relationship between the independent (IV) and dependent variables (DV), but not really the IV we’re interested in.

A better way to do it in this situation is to compare the means among groups at a low value of Age, say 20, and again at a high value of Age, say 50.  You can get p-values, adjusted for multiple comparisons, using either SAS or SPSS GLM.

SAS Proc GLM uses the LSMeans statement and SPSS GLM uses EMMeans.  They do the same thing–calculate the mean of Y for each group, at a specific value of the covariate.

If you use the menus in SPSS, you can only get those EMMeans at the Covariate’s mean, which in this example is about 25, where the vertical black line is.  This isn’t very useful for our purposes.  But we can change the value of the covariate at which to compare the means using syntax.

So it would tell us that at a young age of say 20, the three treatment groups (green, tan, and purple lines) all have means higher than the control (blue).  Young people learned more in all three treatment groups.

But at an older age, say 50, the means of the purple and tan groups were not significantly different from the control group’s (blue), and the green  (EIQ group) did worse!

In SPSS GLM, the syntax would be:



Steps to Take When Your Regression (or Other Statistical) Results Just Look…Wrong

April 19th, 2010 by

Stage 2You’ve probably experienced this before. You’ve done a statistical analysis, you’ve figured out all the steps, you finally get results and are able to interpret them. But the statistical results just look…wrong. Backwards, or even impossible—theoretically or logically.

This happened a few times recently to a couple of my consulting clients, and once to me. So I know that feeling of panic well. There are so many possible causes of incorrect results, but there are a few steps you can take that will help you figure out which one you’ve got and how (and whether) to correct it.

Errors in Data Coding and Entry

In both of my clients’ cases, the problem was that they had coded missing data with an impossible and extreme value, like 99. But they failed to define that code as missing in SPSS. So SPSS took 99 as a real data point, which (more…)

Mediators, Moderators, and Suppressors: What IS the difference?

March 10th, 2010 by

One of the biggest questions I get is about the difference between mediators, moderators, and how they both differ from control variables.Stage 2

I recently found a fabulous free video tutorial on the difference between mediators, moderators, and suppressor variables, by Jeremy Taylor at Stats Make Me Cry.   The witty example is about the different types of variables–talent, practice, etc.–that explain the relationship between having a guitar and making lots of $$.


The Distribution of Independent Variables in Regression Models

January 19th, 2010 by

Stage 2While there are a number of distributional assumptions in regression models, one distribution that has no assumptions is that of any predictor (i.e. independent) variables.

It’s because regression models are directional. In a correlation, there is no direction–Y and X are interchangeable. If you switched them, you’d get the same correlation coefficient.

But regression is inherently a model about the outcome variable. What predicts its value and how well? The nature of how predictors relate to it (more…)

Making Dummy Codes Easy to Keep Track of

January 14th, 2010 by

Here’s a little tip.Stage 2

When you construct Dummy Variables, make it easy on yourself  to remember which code is which.  Heck, if you want to be really nice, make it easy for anyone else who will analyze the data or read the results.

Make the codes inherent in the Dummy variable name.

So instead of a variable named Gender with values of 1=Female and 0=Male, call the variable Female.

Instead of a set of dummy variables named MaritalStatus1 with values of 1=Married and 0=Single, along with MaritalStatus2 with values 1=Divorced and 0=Single, name the same variables Married and Divorced.

And if you’re new to dummy coding, this has the extra bonus of making the dummy coding intuitive.  It’s just a set of yes/no variables about all but one of your categories.


3 Mistakes Data Analysts Make in Testing Assumptions in GLM

September 1st, 2009 by

I know you know it–those assumptions in your regression or ANOVA model really are important.  If they’re not met adequately, all your p-values are inaccurate, wrong, useless.

But, and this is a big one, linear models are robust to departures from those assumptions.  Meaning, they don’t have to fit exactly for p-values to be accurate, right, and useful.

You’ve probably heard both of these contradictory statements in stats classes and a million other places, and they are the kinds of statements that drive you crazy.  Right?

I mean, do statisticians make this stuff up just to torture researchers? Or just to keep you feeling stupid?

No, they really don’t.   (I promise!)  And learning how far you can push those robust assumptions isn’t so hard, with some training and a little practice.  Over the years, I’ve found a few mistakes researchers commonly make because of one, or both, of these statements:

1.  They worry too much about the assumptions and over-test them. There are some nice statistical tests to determine if your assumptions are met.  And it’s so nice having a p-value, right?  Then it’s clear what you’re supposed to do, based on that golden rule of p<.05.

The only problem is that many of these tests ignore that robustness.  They find that every distribution is non-normal and heteroskedastic.  They’re good tools, but  these hammers think every data set is a nail.  You want to use the hammer when needed, but don’t hammer everything.

2.They assume everything is robust anyway, so they don’t test anything. It’s easy to do.  And once again, it probably works out much of the time.  Except when it doesn’t.

Yes, the GLM is robust to deviations from some of the assumptions.  But not all the way, and not all the assumptions.  You do have to check them.

3. They test the wrong assumptions. Look at any two regression books and they’ll give you a different set of assumptions.

This is partially because many of these “assumptions”  need to be checked, but they’re not really model assumptions, they’re data issues.  And it’s also partially because sometimes the assumptions have been taken to their logical conclusions.  That textbook author is trying to make it more logical for you.  But sometimes that just leads you to testing the related, but wrong thing.  It works out most of the time, but not always.