Multicollinearity

In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.

Perfect multicollinearity refers to a situation where the predictive variables have an exact linear relationship. When there is perfect collinearity, the design matrix has less than full rank, and therefore the moment matrix cannot be inverted. In this situation, the parameter estimates of the regression are not well-defined, as the system of equations has infinitely many solutions.

Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship.

Contrary to popular belief, neither the Gauss–Markov theorem nor the more common maximum likelihood justification for ordinary least squares relies on any kind of correlation structure between dependent predictors[1][2][3] (although perfect collinearity can cause problems with some software).

There is no justification for the practice of removing collinear variables as part of regression analysis,[1][4][5][6][7] and doing so may constitute scientific misconduct. Including collinear variables does not reduce the predictive power or reliability of the model as a whole,[6] and does not reduce the accuracy of coefficient estimates.[1]

High collinearity indicates that it is exceptionally important to include all collinear variables, as excluding any will cause worse coefficient estimates, strong confounding, and downward-biased estimates of standard errors.[2]

To address the high collinearity of a dataset, variance inflation factor can be used to identify the collinearity of the predictor variables.

Perfect multicollinearity

A depiction of multicollinearity.
In a linear regression, the true parameters are which are reliably estimated in the case of uncorrelated and (black case) but are unreliably estimated when and are correlated (red case).

Perfect multicollinearity refers to a situation where the predictors are linearly dependent (one can be written as an exact linear function of the others).[8] Ordinary least squares requires inverting the matrix , where

is an matrix, where is the number of observations, is the number of explanatory variables, and . If there is an exact linear relationship among the independent variables, then at least one of the columns of is a linear combination of the others, and so the rank of (and therefore of ) is less than , and the matrix will not be invertible.

Resolution

Perfect collinearity is typically caused by including redundant variables in a regression. For example, a dataset may include variables for income, expenses, and savings. However, because income is equal to expenses plus savings by definition, it is incorrect to include all 3 variables in a regression simultaneously. Similarly, including a dummy variable for every category (e.g., summer, autumn, winter, and spring) as well as an intercept term will result in perfect collinearity. This is known as the dummy variable trap.[9]

The other common cause of perfect collinearity is attempting to use ordinary least squares when working with very wide datasets (those with more variables than observations). These require more advanced data analysis techniques like Bayesian hierarchical modeling to produce meaningful results.[citation needed]

Numerical issues

Sometimes, the variables are nearly collinear. In this case, the matrix has an inverse, but it is ill-conditioned. A computer algorithm may or may not be able to compute an approximate inverse; even if it can, the resulting inverse may have large rounding errors.

The standard measure of ill-conditioning in a matrix is the condition index. This determines if the inversion of the matrix is numerically unstable with finite-precision numbers, indicating the potential sensitivity of the computed inverse to small changes in the original matrix. The condition number is computed by finding the maximum singular value divided by the minimum singular value of the design matrix.[10] In the context of collinear variables, the variance inflation factor is the condition number for a particular coefficient.

Solutions

Numerical problems in estimating can be solved by applying standard techniques from linear algebra to estimate the equations more precisely:

  1. Standardizing predictor variables. Working with polynomial terms (e.g. , ), including interaction terms (i.e., ) can cause multicollinearity. This is especially true when the variable in question has a limited range. Standardizing predictor variables will eliminate this special kind of multicollinearity for polynomials of up to 3rd order.[11]
  2. Use an orthogonal representation of the data.[12] Poorly-written statistical software will sometimes fail to converge to a correct representation when variables are strongly correlated. However, it is still possible to rewrite the regression to use only uncorrelated variables by performing a change of basis.
    • For polynomial terms in particular, it is possible to rewrite the regression as a function of uncorrelated variables using orthogonal polynomials.

Effects on coefficient estimates

In addition to causing numerical problems, imperfect collinearity makes precise estimation of variables difficult. In other words, highly correlated variables lead to poor estimates and large standard errors.

As an example, say that we notice Alice wears her boots whenever it is raining and that there are only puddles when it rains. Then, we cannot tell whether she wears boots to keep the rain from landing on her feet, or to keep her feet dry if she steps in a puddle.

The problem with trying to identify how much each of the two variables matters is that they are confounded with each other: our observations are explained equally well by either variable, so we do not know which one of them causes the observed correlations.

There are two ways to discover this information:

  1. Using prior information or theory. For example, if we notice Alice never steps in puddles, we can reasonably argue puddles are not why she wears boots, as she does not need the boots to avoid puddles.
  2. Collecting more data. If we observe Alice enough times, we will eventually see her on days where there are puddles but not rain (e.g. because the rain stops before she leaves home).

This confounding becomes substantially worse when researchers attempt to ignore or suppress it by excluding these variables from the regression (see #Misuse). Excluding multicollinear variables from regressions will invalidate causal inference and produce worse estimates by removing important confounders.

Remedies

There are many ways to prevent multicollinearity from affecting results by planning ahead of time. However, these methods all require a researcher to decide on a procedure and analysis before data has been collected (see post hoc analysis and Multicollinearity § Misuse).

Regularized estimators

Many regression methods are naturally "robust" to multicollinearity and generally perform better than ordinary least squares regression, even when variables are independent. Regularized regression techniques such as ridge regression, LASSO, elastic net regression, or spike-and-slab regression are less sensitive to including "useless" predictors, a common cause of collinearity. These techniques can detect and remove these predictors automatically to avoid problems. Bayesian hierarchical models (provided by software like BRMS) can perform such regularization automatically, learning informative priors from the data.

Often, problems caused by the use of frequentist estimation are misunderstood or misdiagnosed as being related to multicollinearity.[3] Researchers are often frustrated not by multicollinearity, but by their inability to incorporate relevant prior information in regressions. For example, complaints that coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is important prior information that is not being incorporated into the model. When this is information is available, it should be incorporated into the prior using Bayesian regression techniques.[3]

Stepwise regression (the procedure of excluding "collinear" or "insignificant" variables) is especially vulnerable to multicollinearity, and is one of the few procedures wholly invalidated by it (with any collinearity resulting in heavily biased estimates and invalidated p-values).[2]

Improved experimental design

When conducting experiments where researchers have control over the predictive variables, researchers can often avoid collinearity by choosing an optimal experimental design in consultation with a statistician.

Acceptance

While the above strategies work in some situations, estimates using advanced techniques may still produce large standard errors. In such cases, the correct response to multicollinearity is to "do nothing".[1] The scientific process often involves null or inconclusive results; not every experiment will be "successful" in the sense of decisively confirmation of the researcher's original hypothesis.

Edward Leamer notes that "The solution to the weak evidence problem is more and better data. Within the confines of the given data set there is nothing that can be done about weak evidence".[3] Leamer notes that "bad" regression results that are often misattributed to multicollinearity instead indicate the researcher has chosen an unrealistic prior probability (generally the flat prior used in OLS).[3]

Damodar Gujarati writes that "we should rightly accept [our data] are sometimes not very informative about parameters of interest".[1] Olivier Blanchard quips that "multicollinearity is God's will, not a problem with OLS";[7] in other words, when working with observational data, researchers cannot "fix" multicollinearity, only accept it.

Misuse

Variance inflation factors are often misused as criteria in stepwise regression (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb".[2]

Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce the true (not estimated) standard errors for regression coefficients.[1] Excluding variables with a high variance inflation factor also invalidates the calculated standard errors and p-values, by turning the results of the regression into a post hoc analysis.[14]

Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to suppress inconvenient data by removing strongly-correlated variables from their regression. This procedure falls into the broader categories of p-hacking, data dredging, and post hoc analysis. Dropping (useful) collinear predictors will generally worsen the accuracy of the model and coefficient estimates.

Similarly, trying many different models or estimation procedures (e.g. ordinary least squares, ridge regression, etc.) until finding one that can "deal with" the collinearity creates a forking paths problem. P-values and confidence intervals derived from post hoc analyses are invalidated by ignoring the uncertainty in the model selection procedure.

It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers. However, this must be done when first specifying the model, prior to observing any data, and potentially-informative variables should always be included.

See also

References

  1. ^ a b c d e f Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are correlated?". Basic Econometrics (4th ed.). McGraw−Hill. pp. 363. ISBN 9780073375779.
  2. ^ a b c d Kalnins, Arturs; Praitis Hill, Kendall (13 December 2023). "The VIF Score. What is it Good For? Absolutely Nothing". Organizational Research Methods. doi:10.1177/10944281231216381. ISSN 1094-4281.
  3. ^ a b c d e Leamer, Edward E. (1973). "Multicollinearity: A Bayesian Interpretation". The Review of Economics and Statistics. 55 (3): 371–380. doi:10.2307/1927962. ISSN 0034-6535. JSTOR 1927962.
  4. ^ Giles, Dave (15 September 2011). "Econometrics Beat: Dave Giles' Blog: Micronumerosity". Econometrics Beat. Retrieved 3 September 2023.
  5. ^ Goldberger,(1964), A.S. (1964). Econometric Theory. New York: Wiley.{{cite book}}: CS1 maint: numeric names: authors list (link)
  6. ^ a b Goldberger, A.S. "Chapter 23.3". A Course in Econometrics. Cambridge MA: Harvard University Press.
  7. ^ a b Blanchard, Olivier Jean (October 1987). "Comment". Journal of Business & Economic Statistics. 5 (4): 449–451. doi:10.1080/07350015.1987.10509611. ISSN 0735-0015.
  8. ^ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2021). An introduction to statistical learning: with applications in R (Second ed.). New York, NY: Springer. p. 115. ISBN 978-1-0716-1418-1. Retrieved 1 November 2024.
  9. ^ Karabiber, Fatih. "Dummy Variable Trap - What is the Dummy Variable Trap?". LearnDataSci (www.learndatasci.com). Retrieved 18 January 2024.
  10. ^ Belsley, David (1991). Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley. ISBN 978-0-471-52889-0.
  11. ^ "12.6 - Reducing Structural Multicollinearity | STAT 501". newonlinecourses.science.psu.edu. Retrieved 16 March 2019.
  12. ^ a b "Computational Tricks with Turing (Non-Centered Parametrization and QR Decomposition)". storopoli.io. Retrieved 3 September 2023.
  13. ^ Gelman, Andrew; Imbens, Guido (3 July 2019). "Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs". Journal of Business & Economic Statistics. 37 (3): 447–456. doi:10.1080/07350015.2017.1366909. ISSN 0735-0015.
  14. ^ Gelman, Andrew; Loken, Eric (14 November 2013). "The garden of forking paths" (PDF). Unpublished – via Columbia.


Further reading

Read other articles:

Czech actress This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Klára Issová – news · newspapers · books · scholar · JSTOR (July 2011) (Learn how and when to remove this template message) Klara ...

 

 

YouTubeJenis usahaAnak PerusahaanJenis situsLayanan hos videoBahasa54 bahasa melalui antarmuka pengguna[1]Didirikan14 Februari 2005; 18 tahun lalu (2005-02-14)Markas901 Cherry AvenueSan Bruno, California, Amerika SerikatWilayah operasiSeluruh dunia (kecuali negara yang telah diblokir)PendiriChad HurleySteve ChenJawed KarimTokoh pentingNeal Mohan (CEO)SektorInternetLayanan hos videoProdukYouTube PremiumYouTube MusicYouTube TVYouTube GoYouTube KidsPendapatanUS$28,8 miliar (202...

 

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (مايو 2021) صناعة وصيانة السفن والعمل على تقوية الأسطول البحري حملات الروس الاستكشافية القزوينية هي غارات عسكرية شنها شعب الروس بين عامي 864 و1041 على شواطئ بحر قزوين،[1]

Champion Jack Dupree William Thomas „Champion Jack“ Dupree (* 23. Oktober 1909[1] in New Orleans; † 21. Januar 1992 in Hannover) war ein amerikanischer Blues-Sänger und -Pianist. Inhaltsverzeichnis 1 Leben 2 Diskografie (Auswahl) 3 Literatur 4 Weblinks 5 Einzelnachweise Leben Nachdem seine Eltern durch ein Feuer ums Leben gekommen waren, kam Dupree im Alter von zwei Jahren in das gleiche Kindererziehungsheim in New Orleans, in dem zuvor schon Louis Armstrong einige Jugendjahre ...

 

 

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Oktober 2022. Man Gyong Bong 92 di Wonsan pada 2010 Man Gyong Bong 92 adalah sebuah feri kargo-penumpang, yang diambil dari nama sebuah bukit di dekat Pyongyang. Feri tersebut dibangun pada 1992 dengan dana darim Chongryon, Asosiasi Umum Pemukim Korea pro-Korea Utar...

 

 

Ini adalah nama Batak Toba, marganya adalah Silalahi. Patar SilalahiDirreskrimum Polda NTT Informasi pribadiLahir26 November 1976 (umur 47)Medan, Sumatera UtaraAlma materAkademi Kepolisian (1999)Karier militerPihak IndonesiaDinas/cabang Kepolisian Daerah Nusa Tenggara TimurMasa dinas1999—sekarangPangkat Komisaris Besar PolisiSatuanReserseSunting kotak info • L • B Kombes. Pol. Patar Marlon Hasudungan Silalahi, S.I.K. (lahir 26 November 1976) adalah seorang perwi...

MalukuNama lengkapTim sepak bola Provinsi MalukuStadionStadion Mandala Remaja(Kapasitas: 20.000)LigaPONPra PON 2021Peringkat 2 di Grup 7 Tim sepak bola Provinsi Maluku atau Tim sepak bola Maluku adalah tim provinsial yang mewakili Maluku dalam cabang olahraga sepak bola pada Pekan Olahraga Nasional. Tim ini dikendalikan oleh Asosiasi Provinsi Persatuan Sepak bola Seluruh Indonesia Maluku (Asprov PSSI Maluku), yang merupakan anggota PSSI. Rekor kompetisi PON PON Kualifikasi Tahun Babak Posisi ...

 

 

село Бессарабка Країна  Україна Область Миколаївська область Район Миколаївський район Громада Коблівська сільська громада Основні дані Засноване 1908 Населення 184 Площа 1,228 км² Густота населення 149,84 осіб/км² Поштовий індекс 57451 Телефонний код +380 5153 Географічні да

 

 

Keuskupan Agung ArequipaArchidioecesis ArequipensisArquidiócese de ArequipaBasilika Katedral Santa MariaLokasiNegara PeruStatistikLuas26.306 km2 (10.157 sq mi)Populasi- Total- Katolik(per 2006)952.000857,000 (90.0%)InformasiRitusRitus LatinKatedralCatedral Basílica Santa MaríaKepemimpinan kiniPausFransiskusUskupJavier Augusto Del Río AlbaAuksilierRaúl Antonio Chau QuispePetaSitus webwww.arzobispadoarequipa.org Keuskupan Agung Arequipa (bahasa Latin...

TantekuAlbum studio karya Trio Kwek KwekDirilis31 Desember 1996GenrePopLabelIdeal RecordMusica Studio'sKronologi Trio Kwek Kwek Jangan Marah(1995)Jangan Marah1995 Tanteku (1996) Katanya(1998)Katanya1998 Tanteku adalah album musik keempat karya Trio Kwek Kwek yang dirilis pada tahun 1996. Berisi 10 buah lagu dengan lagu berjudul sama dengan album, Tanteku dan juga lagu lawas Sepatu Kaca dari Ira Maya Sopha, sebagai lagu utama album ini.[1] Daftar lagu Judul lagu Pencipta Tanteku Pa...

 

 

This article is an orphan, as no other articles link to it. Please introduce links to this page from related articles; try the Find link tool for suggestions. (August 2023) Phanephos Names IUPAC name (S)-(+)-4,12-Bis(diphenylphosphino)-[2.2]-paracyclophane; (R)-(−)-4,12-Bis(diphenylphosphino)-[2.2]-paracyclophane Other names (S)-Phanephos; (R)-Phanephos Identifiers CAS Number (S): 364732-88-7 Y(R): 192463-40-4 Y 3D model (JSmol) (S): Interactive image ChemSpider...

 

 

Este artigo não cita fontes confiáveis. Ajude a inserir referências. Conteúdo não verificável pode ser removido.—Encontre fontes: ABW  • CAPES  • Google (N • L • A) (Junho de 2013) Capitel em estilo compósita no Museu de Taranto, Itália. A compósita foi uma ordem da arquitetura clássica desenvolvida pelos romanos a partir dos desenhos das ordens jónica e coríntia. Até o período do Renascimento a ordem foi considerad...

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (أبريل 2019) شارلوته دوس سانتوس   معلومات شخصية الميلاد سنة 1990 (العمر 32–33 سنة)  أوسلو  مواطنة النرويج  الحياة العملية المهنة مغنية  اللغات النرويجية  تعد...

 

 

American video game hardware company Analogue, Inc.TypePrivateIndustryVideo game hardwareConsumer electronicsFounded2011; 12 years agoFounderChristopher TaberArea servedWorldwideKey people Christopher Taber (CEO) Ernest Dorazio III (COO) Marshall Hecht (Senior Engineer) Kevin Horton (Director of FPGA Development) Number of employees31 (2022)Websitewww.analogue.co  Analogue, Inc. is an American company that designs, develops, and sells video game hardware.[1] Its hardware products...

 

 

Narrative technique in visual media Not to be confused with Sex position. The scene that caused a critic to coin the word sexposition: In Game of Thrones, Petyr Baelish (Aidan Gillen) explains his character's childhood and goals to two actresses who are simulating sexual activity. In visual media such as television and film, sexposition is the technique of providing exposition against a backdrop of sex or nudity.[1] The Financial Times defined sexposition as keeping viewers hooked by ...

Monstro redirects here. For the Marvel Comics character that sometimes went by that name, see Giganto. For the band, see MonstrO (band). Fictional character The Terrible DogfishThe Adventures of Pinocchio characterIl Terribile Pescecane swallows Pinocchio, as drawn by Enrico MazzantiFirst appearanceThe Adventures of PinocchioCreated byCarlo CollodiIn-universe informationSpeciesGiant Dogfish The Terrible Dogfish (Italian: Il Terribile Pescecane) is a dogfish-like sea monster, which appears in ...

 

 

American military radio station in Germany AFN Frankfurt was a radio station in Frankfurt, Germany, that was operational from 1945 to 2004. It was a part of the American Forces Network (AFN) broadcasting to US soldiers serving overseas, and long served as headquarters of AFN Europe. It was popular not just with soldiers, but also with a German shadow audience, and was instrumental in introducing several American musical styles to German listeners. History Höchst Castle During World War II, t...

 

 

Person who designs For other uses, see Designer (disambiguation). Not to be confused with Desiigner. This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Designer – news · newspapers · books · scholar · JSTOR (July 2021) (Learn how and when to remove this template message) DesignerMUSAOccupationOccupation typeProfessionDescription...

Railway station in Sapporo, Japan Heiwa Station平和駅General informationLocationShiroishi-ku, Sapporo, HokkaidoJapanOperated byLine(s)     Chitose LineDistance54.4 km (33.8 mi) from NumanohataPlatforms1 island platformTracks2Other informationStatusStaffedStation codeH04HistoryOpened1 November 1986; 37 years ago (1986-11-01)PassengersFY20142,772 daily Services Preceding station JR Hokkaido Following station Shin-Sapporotowards Numanohata ...

 

 

Hotel in Nevada, United StatesAlexis Park All Suite ResortGeneral informationTypeHotelAddress375 East Harmon AvenueTown or cityParadise, NevadaCountryUnited StatesCoordinates36°06′23″N 115°09′22″W / 36.106499°N 115.156113°W / 36.106499; -115.156113OpenedJuly 2, 1984Renovated2004Cost$40 millionRenovation cost$5 millionGrounds19 acresDesign and constructionDeveloperSchulman Development CorporationOther informationNumber of rooms500Websitewww.alexispark.com Th...

 

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!