The practice of choosing predictors for a regression model, called model building, is an area of real craft.
There are many possible strategies and approaches and they all work well in some situations. Every one of them requires making a lot of decisions along the way. As you make decisions, one danger to look out for is overfitting—creating a model that is too complex for the the data. (more…)
A key part of the output in any linear model is the ANOVA table. It has many names in different software procedures, but every regression or ANOVA
model has a table with Sums of Squares, degrees of freedom, mean squares, and F tests. Many of us were trained to skip over this table, but
(more…)

In your statistics class, your professor made a big deal about unequal sample sizes in one-way Analysis of Variance (ANOVA) for two reasons.
1. Because she was making you calculate everything by hand. Sums of squares require a different formula* if sample sizes are unequal, but statistical software will automatically use the right formula. So we’re not too concerned. We’re definitely using software.
2. Nice properties in ANOVA such as the Grand Mean being the intercept in an effect-coded regression model don’t hold when data are unbalanced. Instead of the grand mean, you need to use a weighted mean. That’s not a big deal if you’re aware of it. (more…)
When you’re model building, a key decision is which interaction terms to include. And which interactions to remove.
As a general rule, the default in regression is to leave them out. Add interactions only with a solid reason. It would seem like data fishing to simply add in all possible interactions.
And yet, that’s a common practice in most ANOVA models: put in all possible interactions and only take them out if there’s a solid reason. Even many software procedures default to creating interactions among categorical predictors.
(more…)
One of the many decisions you have to make when model building is which form each predictor variable should take. One specific version of this
decision is whether to combine categories of a categorical predictor.
The greater the number of parameter estimates in a model the greater the number of observations that are needed to keep power constant. The parameter estimates in a linear (more…)

Multicollinearity is one of those terms in statistics that is often defined in one of two ways:
1. Very mathematical terms that make no sense — I mean, what is a linear combination anyway?
2. Completely oversimplified in order to avoid the mathematical terms — it’s a high correlation, right?
So what is it really? In English?
(more…)