Multicollinearity can affect any regression model with more than one predictor. It occurs when two or more predictor variables overlap so much in what they measure that their effects are indistinguishable.
When the model tries to estimate their unique effects, it goes wonky (yes, that’s a technical term).
So for example, you may be interested in understanding the separate effects of altitude and temperature on the growth of a certain species of mountain tree.
Altitude and temperature are distinct concepts, but the mean temperature is so correlated with the altitude at which the tree is growing that there is no way to separate out their effects.
But it’s not always easy to tell that the wonkiness in your model comes from multicollinearity.
One popular detection method is based on the bivariate correlation between two predictor variables. If it’s above .8 (or .7 or .9 or some other high number), the rule of thumb says you have multicollinearity.
And it is certainly true that a high correlation between two predictors is an indicator of multicollinearity. But there are two problems with treating this rule of thumb as a rule.
First, how high that correlation has to be before you’re finding inflated variances depends on the sample size. There is no one good cut off number.
Second, it’s possible that while no two variables are highly correlated, three or more together are multicollinear. Weird idea, I know. But it happens.
You’ll completely miss the multicollinearity in that situation if you’re just looking at bivariate correlations.
So like a lot of things in statistics, when you’re checking for multicollinearity, you have to check multiple indicators and look for patterns among them. Sometimes just one is all it takes and sometimes you need to see patterns among a few.
Seven more ways to detect multicollinearity
1. Very high standard errors for regression coefficients
When standard errors are orders of magnitude higher than their coefficients, that’s an indicator.
2. The overall model is significant, but none of the coefficients are
Remember that a p-value for a coefficient tests whether the unique effect of that predictor on Y is zero. If all predictors overlap in what they measure, there is little unique effect, even if the predictors as a group have an effect on Y.
3. Large changes in coefficients when adding predictors
If the predictors are completely independent of each other, their coefficients won’t change at all when you add or remove one. But the more they overlap, the more drastically their coefficients will change.
4. Coefficients have signs opposite what you’d expect from theory
Be careful here as you don’t want to disregard an unexpected finding as problematic. Not all effects opposite theory indicate a problem with the model. That said, it could be multicollinearity and warrants taking a second look at other indicators.
5. Coefficients on different samples are wildly different
If you have a large enough sample, split the sample in half and run the model separately on each half. Wildly different coefficients in the two models could be a sign of multicollinearity.
6. High Variance Inflation Factor (VIF) and Low Tolerance
These two useful statistics are reciprocals of each other. So either a high VIF or a low tolerance is indicative of multicollinearity. VIF is a direct measure of how much the variance of the coefficient (ie. its standard error) is being inflated due to multicollinearity.
7. High Condition Indices
Condition indices are a bit strange. The basic idea is to run a Principal Components Analysis on all predictors. If they have a lot of shared information, the first Principal Component will be much higher than the last. Their ratio, the Condition Index, will be high if multicollinearity is present.
Jon K Peck says
Two other thoughts.
examining the covariance or correlation matrix of the estimated coefficients is a better mc measure than the correlation matrix of the regressors as it shows directly the sensitivity of each betahat to the other variables.
The sensitivity of each betahat to the number of regressors across all the possible regressions as the number increases shows the dependencies,. In SPSS, the STATS RELIMP extension command shows this nicely.
Gatwech says
Nice explanation.
Thanks!
Yingcong Chen says
Hi Karen,
I have 2 categorical variables as IV and 3 continuous variables as DV for MANOVA, while the correlation between DVs was over .9, could you please give me some guidance about how to solve this multicollinearity problem? (PCA did not work)
Thanks for your help~
Karen Grace-Martin says
Multicollinearity is really only about IVs. If you have high correlation among multiple DVs, then you’re on the right track using MANOVA.
Guiping liu says
can’t agree more.!
Narayan says
Hi Karen,
Your blogs on multicollinearity are very helpful in understanding the concept. This one focuses on regression problems.
However, for a Classification problem with mostly categorical variables, much of these rules don’t apply.
Do you have a similar set of rules for Classification?
Regards,
Narayan
Karen Grace-Martin says
I’m not sure what you mean by a classification problem. If it’s something like using logistic regression to classify individuals, it applies. If you’re talking about something like a tree model, sure, it will be different, but that’s really more about variable selection.
kassawmar says
Can we use VIF for nonlinear variables to check the multicollinearity effect? If So how? and what are the alternative methods?
F says
If VIF values are less than 5 some researchers suggested even 10, implying that the CMB was not a problem for evaluating the structural model. Hence no issue about multicollinearity
Seren says
Could you use a Chi square test to identify multicollinearity?
For instance if a Chi square test gave a Cramer’s V effect size that indicated that the two variables were probably measuring the same concept ( Redundant) is this evidence for multicollinearity in regression with those two variables as predictors?
Thanks very much for the stats help!
Karen Grace-Martin says
Hi Seren,
Sure, if both variables were categorical. It’s essentially the same concept as a correlation.
Seren says
Amazing, thank you!
Sahel says
Hello Karen,
The result of Chi-square between my two independent categorical variables (binary(2) +categorical(3)) showed a p value less than 0.05, but the Cramer’s v is 0.17 which shows weak association. Can I keep both independent variables in my logistic regression model for binary dependent variable? both of them are clinically important and worthy to publish.
Thank, you
Karen Grace-Martin says
If they’re both clinically important, absolutely.