Eight Ways to Detect Multicollinearity

Multicollinearity can affect any regression model with more than one predictor. It occurs when two or more predictor variables overlap so much in what they measure that their effects are indistinguishable.

When the model tries to estimate their unique effects, it goes wonky (yes, that’s a technical term).

So for example, you may be interested in understanding the separate effects of altitude and temperature on the growth of a certain species of mountain tree.

Altitude and temperature are distinct concepts, but the mean temperature is so correlated with the altitude at which the tree is growing that there is no way to separate out their effects.

But it’s not always easy to tell that the wonkiness in your model comes from multicollinearity.

One popular detection method is based on the bivariate correlation between two predictor variables. If it’s above .8 (or .7 or .9 or some other high number), the rule of thumb says you have multicollinearity.

And it is certainly true that a high correlation between two predictors is an indicator of multicollinearity. But there are two problems with treating this rule of thumb as a rule.

First, how high that correlation has to be before you’re finding inflated variances depends on the sample size. There is no one good cut off number.

Second, it’s possible that while no two variables are highly correlated, three or more together are multicollinear. Weird idea, I know. But it happens.

You’ll completely miss the multicollinearity in that situation if you’re just looking at bivariate correlations.

So like a lot of things in statistics, when you’re checking for multicollinearity, you have to check multiple indicators and look for patterns among them. Sometimes just one is all it takes and sometimes you need to see patterns among a few.

Seven more ways to detect multicollinearity

1. Very high standard errors for regression coefficients

When standard errors are orders of magnitude higher than their coefficients, that’s an indicator.

2. The overall model is significant, but none of the coefficients are

Remember that a p-value for a coefficient tests whether the unique effect of that predictor on Y is zero. If all predictors overlap in what they measure, there is little unique effect, even if the predictors as a group have an effect on Y.

3. Large changes in coefficients when adding predictors

If the predictors are completely independent of each other, their coefficients won’t change at all when you add or remove one. But the more they overlap, the more drastically their coefficients will change.

4. Coefficients have signs opposite what you’d expect from theory

Be careful here as you don’t want to disregard an unexpected finding as problematic. Not all effects opposite theory indicate a problem with the model. That said, it could be multicollinearity and warrants taking a second look at other indicators.

5. Coefficients on different samples are wildly different

If you have a large enough sample, split the sample in half and run the model separately on each half. Wildly different coefficients in the two models could be a sign of multicollinearity.

6. High Variance Inflation Factor (VIF) and Low Tolerance

These two useful statistics are reciprocals of each other. So either a high VIF or a low tolerance is indicative of multicollinearity. VIF is a direct measure of how much the variance of the coefficient (ie. its standard error) is being inflated due to multicollinearity.

7. High Condition Indices

Condition indices are a bit strange. The basic idea is to run a Principal Components Analysis on all predictors. If they have a lot of shared information, the first Principal Component will be much higher than the last. Their ratio, the Condition Index, will be high if multicollinearity is present.

Fixed and Random Factors in Mixed Models

One of the hardest parts of mixed models is understanding which factors to make fixed and which to make random. Learn the important criteria to help you decide.

Comments

Jon K Peck says

August 5, 2024 at 3:51 pm

Two other thoughts.
examining the covariance or correlation matrix of the estimated coefficients is a better mc measure than the correlation matrix of the regressors as it shows directly the sensitivity of each betahat to the other variables.

The sensitivity of each betahat to the number of regressors across all the possible regressions as the number increases shows the dependencies,. In SPSS, the STATS RELIMP extension command shows this nicely.

Reply
Gatwech says

August 5, 2024 at 7:20 am

Nice explanation.
Thanks!

Reply
Yingcong Chen says

December 27, 2022 at 5:34 am

Hi Karen,
I have 2 categorical variables as IV and 3 continuous variables as DV for MANOVA, while the correlation between DVs was over .9, could you please give me some guidance about how to solve this multicollinearity problem? (PCA did not work)
Thanks for your help~

Reply
- Karen Grace-Martin says
  
  January 9, 2023 at 11:24 am
  
  Multicollinearity is really only about IVs. If you have high correlation among multiple DVs, then you’re on the right track using MANOVA.
  
  Reply
  - Guiping liu says
    
    January 4, 2024 at 4:54 pm
    
    can’t agree more.!
    
    Reply
Narayan says

November 30, 2022 at 6:47 am

Hi Karen,
Your blogs on multicollinearity are very helpful in understanding the concept. This one focuses on regression problems.
However, for a Classification problem with mostly categorical variables, much of these rules don’t apply.
Do you have a similar set of rules for Classification?
Regards,
Narayan

Reply
- Karen Grace-Martin says
  
  December 21, 2022 at 3:19 pm
  
  I’m not sure what you mean by a classification problem. If it’s something like using logistic regression to classify individuals, it applies. If you’re talking about something like a tree model, sure, it will be different, but that’s really more about variable selection.
  
  Reply
kassawmar says

October 29, 2019 at 10:23 am

Can we use VIF for nonlinear variables to check the multicollinearity effect? If So how? and what are the alternative methods?

Reply
- F says
  
  January 10, 2022 at 12:06 am
  
  If VIF values are less than 5 some researchers suggested even 10, implying that the CMB was not a problem for evaluating the structural model. Hence no issue about multicollinearity
  
  Reply
Seren says

June 5, 2019 at 8:02 am

Could you use a Chi square test to identify multicollinearity?

For instance if a Chi square test gave a Cramer’s V effect size that indicated that the two variables were probably measuring the same concept ( Redundant) is this evidence for multicollinearity in regression with those two variables as predictors?

Thanks very much for the stats help!

Reply
- Karen Grace-Martin says
  
  June 7, 2019 at 4:50 pm
  
  Hi Seren,
  
  Sure, if both variables were categorical. It’s essentially the same concept as a correlation.
  
  Reply
  - Seren says
    
    June 14, 2019 at 7:36 am
    
    Amazing, thank you!
    
    Reply
  - Sahel says
    
    February 2, 2023 at 6:21 pm
    
    Hello Karen,
    
    The result of Chi-square between my two independent categorical variables (binary(2) +categorical(3)) showed a p value less than 0.05, but the Cramer’s v is 0.17 which shows weak association. Can I keep both independent variables in my logistic regression model for binary dependent variable? both of them are clinically important and worthy to publish.
    Thank, you
    
    Reply
    - Karen Grace-Martin says
      
      February 3, 2023 at 9:47 am
      
      If they’re both clinically important, absolutely.
      
      Reply