Mahalanobis distance

The Mahalanobis distance is a measure of the distance between a point and a distribution , introduced by P. C. Mahalanobis in 1933.[1] The mathematical details of Mahalanobis distance first appeared in the Journal of The Asiatic Society of Bengal in 1933.[2] Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements (the earliest work related to similarities of skulls are from 1922 and another later work is from 1927).[3][4] R.C. Bose later obtained the sampling distribution of Mahalanobis distance, under the assumption of equal dispersion.[5]

It is a multivariate generalization of the square of the standard score : how many standard deviations away is from the mean of . This distance is zero for at the mean of and grows as moves away from the mean along each principal component axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set.

Definition

Given a probability distribution on , with mean and positive semi-definite covariance matrix , the Mahalanobis distance of a point from is[6]Given two points and in , the Mahalanobis distance between them with respect to iswhich means that .

Since is positive semi-definite, so is , thus the square roots are always defined.

We can find useful decompositions of the squared Mahalanobis distance that help to explain some reasons for the outlyingness of multivariate observations and also provide a graphical tool for identifying outliers.[7]

By the spectral theorem, can be decomposed as for some real matrix, which gives us the equivalent definitionwhere is the Euclidean norm. That is, the Mahalanobis distance is the Euclidean distance after a whitening transformation.

The existence of is guaranteed by the spectral theorem, but it is not unique. Different choices have different theoretical and practical advantages.[8]

In practice, the distribution is usually the sample distribution from a set of IID samples from an underlying unknown distribution, so is the sample mean, and is the covariance matrix of the samples.

When the affine span of the samples is not the entire , the covariance matrix would not be positive-definite, which means the above definition would not work. However, in general, the Mahalanobis distance is preserved under any full-rank affine transformation of the affine span of the samples. So in case the affine span is not the entire , the samples can be first orthogonally projected to , where is the dimension of the affine span of the samples, then the Mahalanobis distance can be computed as usual.

Intuitive explanation

Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the centroid or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be , which reads: . By plugging this into the normal distribution, we can derive the probability of the test point belonging to the set.

The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

Normal distributions

For a normal distribution in any number of dimensions, the probability density of an observation is uniquely determined by the Mahalanobis distance :

Specifically, follows the chi-squared distribution with degrees of freedom, where is the number of dimensions of the normal distribution. If the number of dimensions is 2, for example, the probability of a particular calculated being less than some threshold is . To determine a threshold to achieve a particular probability, , use , for 2 dimensions. For number of dimensions other than 2, the cumulative chi-squared distribution should be consulted.

In a normal distribution, the region where the Mahalanobis distance is less than one (i.e. the region inside the ellipsoid at distance one) is exactly the region where the probability distribution is concave.

The Mahalanobis distance is proportional, for a normal distribution, to the square root of the negative log-likelihood (after adding a constant so the minimum is at zero).

Other forms of multivariate location and scatter

Hypothetical two-dimensional example of Mahalanobis distance with three different methods of defining the multivariate location and scatter of the data.

The sample mean and covariance matrix can be quite sensitive to outliers, therefore other approaches for calculating the multivariate location and scatter of data are also commonly used when calculating the Mahalanobis distance. The Minimum Covariance Determinant approach estimates multivariate location and scatter from a subset numbering data points that has the smallest variance-covariance matrix determinant.[9] The Minimum Volume Ellipsoid approach is similar to the Minimum Covariance Determinant approach in that it works with a subset of size data points, but the Minimum Volume Ellipsoid estimates multivariate location and scatter from the ellipsoid of minimal volume that encapsulates the data points.[10] Each method varies in its definition of the distribution of the data, and therefore produces different Mahalanobis distances. The Minimum Covariance Determinant and Minimum Volume Ellipsoid approaches are more robust to samples that contain outliers, while the sample mean and covariance matrix tends to be more reliable with small and biased data sets.[11]

Relationship to normal random variables

In general, given a normal (Gaussian) random variable with variance and mean , any other normal random variable (with mean and variance ) can be defined in terms of by the equation Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for . If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.

Relationship to leverage

Mahalanobis distance is closely related to the leverage statistic, , but has a different scale:

Applications

Mahalanobis distance is widely used in cluster analysis and classification techniques. It is closely related to Hotelling's T-square distribution used for multivariate statistical testing and Fisher's linear discriminant analysis that is used for supervised classification.[12]

In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.

Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. Even for normal distributions, a point can be a multivariate outlier even if it is not a univariate outlier for any variable (consider a probability density concentrated along the line , for example), making Mahalanobis distance a more sensitive measure than checking dimensions individually.

Mahalanobis distance has also been used in ecological niche modelling,[13][14] as the convex elliptical shape of the distances relates well to the concept of the fundamental niche.

Another example of usage is in finance, where Mahalanobis distance has been used to compute an indicator called the "turbulence index",[15] which is a statistical measure of financial markets abnormal behaviour. An implementation as a Web API of this indicator is available online.[16]

Software implementations

Many programming languages and statistical packages, such as R, Python, etc., include implementations of Mahalanobis distance.

Language/program Function Ref.
Julia mahalanobis(x, y, Q) [1]
MATLAB mahal(x, y) [2]
R mahalanobis(x, center, cov, inverted = FALSE, ...) [3]
SciPy (Python) mahalanobis(u, v, VI) [4]

See also

References

  1. ^ "Reprint of: Mahalanobis, P.C. (1936) "On the Generalised Distance in Statistics."". Sankhya A. 80 (1): 1–7. 2018-12-01. doi:10.1007/s13171-019-00164-5. ISSN 0976-8378.
  2. ^ Journal and Procedings Of The Asiatic Society Of Bengal Vol-xxvi. Asiatic Society Of Bengal Calcutta. 1933.
  3. ^ Mahalanobis, Prasanta Chandra (1922). Anthropological Observations on the Anglo-Indians of Culcutta---Analysis of Male Stature.
  4. ^ Mahalanobis, Prasanta Chandra (1927). "Analysis of race mixture in Bengal". Journal and Proceedings of the Asiatic Society of Bengal. 23: 301–333.
  5. ^ Science And Culture (1935-36) Vol. 1. Indian Science News Association. 1935. pp. 205–206.
  6. ^ De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D. L. (2000). "The Mahalanobis distance". Chemometrics and Intelligent Laboratory Systems. 50 (1): 1–18. doi:10.1016/s0169-7439(99)00047-7.
  7. ^ Kim, M. G. (2000). "Multivariate outliers and decompositions of Mahalanobis distance". Communications in Statistics – Theory and Methods. 29 (7): 1511–1526. doi:10.1080/03610920008832559. S2CID 218567835.
  8. ^ Kessy, Agnan; Lewin, Alex; Strimmer, Korbinian (2018-10-02). "Optimal Whitening and Decorrelation". The American Statistician. 72 (4): 309–314. arXiv:1512.00809. doi:10.1080/00031305.2016.1277159. ISSN 0003-1305. S2CID 55075085.
  9. ^ Hubert, Mia; Debruyne, Michiel (2010). "Minimum covariance determinant". WIREs Computational Statistics. 2 (1): 36–43. doi:10.1002/wics.61. ISSN 1939-5108. S2CID 123086172.
  10. ^ Van Aelst, Stefan; Rousseeuw, Peter (2009). "Minimum volume ellipsoid". Wiley Interdisciplinary Reviews: Computational Statistics. 1 (1): 71–82. doi:10.1002/wics.19. ISSN 1939-5108. S2CID 122106661.
  11. ^ Etherington, Thomas R. (2021-05-11). "Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method". PeerJ. 9: e11436. doi:10.7717/peerj.11436. ISSN 2167-8359. PMC 8121071. PMID 34026369.
  12. ^ McLachlan, Geoffrey (4 August 2004). Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons. pp. 13–. ISBN 978-0-471-69115-0.
  13. ^ Etherington, Thomas R. (2019-04-02). "Mahalanobis distances and ecological niche modelling: correcting a chi-squared probability error". PeerJ. 7: e6678. doi:10.7717/peerj.6678. ISSN 2167-8359. PMC 6450376. PMID 30972255.
  14. ^ Farber, Oren; Kadmon, Ronen (2003). "Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance". Ecological Modelling. 160 (1–2): 115–130. doi:10.1016/S0304-3800(02)00327-7.
  15. ^ Kritzman, M.; Li, Y. (2019-04-02). "Skulls, Financial Turbulence, and Risk Management". Financial Analysts Journal. 66 (5): 30–41. doi:10.2469/faj.v66.n5.3. S2CID 53478656.
  16. ^ "Portfolio Optimizer". portfoliooptimizer.io/. Retrieved 2022-04-23.

Read other articles:

Minister of TransportMinistre des TransportsIncumbentPablo Rodriguezsince 26 July 2023Transport CanadaStyleThe HonourableMember ofCabinetPrivy CouncilReports toParliamentPrime MinisterAppointerMonarch (represented by the governor general)on the advice of the prime ministerTerm lengthAt His Majesty's pleasurePrecursorMinister of Railways and Canals Minister of MarineInaugural holderC. D. HoweFormation2 November 1936DeputyDeputy Minister of TransportSalary$279,900 (2022)[1]Websitew...

 

This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help to improve this article by introducing more precise citations. (August 2011) (Learn how and when to remove this template message) A 4-piastre Syrian postage stamps overprinted and issued in 1938 for the Sanjak of Alexandretta, cancelled sometime in 1939. Stamp of Hatay overprinted T. C. ilhak tarihi 30-6-1939 (Annexation 30-6-1939). This...

 

Film school in Buckinghamshire, England NFTS redirects here. For similar terms, see NFT (disambiguation) and NFS (disambiguation). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: National Film and Television School – news · newspapers · books · scholar · JSTOR (January 2019) (Learn how and when to remove thi...

Scha Dara ParrPenampilan Scha Dara Parr di 2011 (di sebelah kiri dengan seragam biru).Informasi latar belakangAsal JepangGenreHip-hop J-popTahun aktif1988–sekarangLabelMajor Force File Records Epic Records Japan Ki/oon Records EMI Music Japan Warner Music Japan Tearbridge RecordsSitus webOfficial websiteAnggotaAni Bose Shinco Scha Dara Parr (スチャダラパーcode: ja is deprecated , Suchadarapā), atau disingkat SDP, adalah grup hip-hop Jepang beranggotakan 3 orang yang dibentuk pa...

 

Jayapura beralih ke halaman ini. Untuk kabupaten bernama sama, lihat Kabupaten Jayapura. Untuk kegunaan lain, lihat Jayapura (disambiguasi). Koordinat: 2°32′13.0158″S 140°42′49.2294″E / 2.536948833°S 140.713674833°E / -2.536948833; 140.713674833 Kota JayapuraIbu kota provinsiKota Jayapura pada malam hari LambangJulukan: Kota Seribu PinangMotto: Prasetya adi karya(Sanskerta) Bertekad untuk mewujudkan karya terbaikKota JayapuraPetaTampilka...

 

American college football season 2021 Montana Grizzlies footballNCAA Division I Quarterfinal, L 6–28 at James MadisonConferenceBig Sky ConferenceRankingSTATSNo. 6FCS CoachesNo. 6Record10–3 (6–2 Big Sky)Head coachBobby Hauck (11th season)Offensive coordinatorTimm Rosenbach (4th season)Defensive coordinatorKent Baer (3rd season)Home stadiumWashington–Grizzly Stadium(capacity: 25,217)Seasons← 20202022 → 2021 Big Sky Conference football ...

American entrepreneur Dylan FieldField in 2022Born1992 (age 30–31)NationalityAmericanEducationBrown University (dropped out)OccupationsTechnology executiveentrepreneurYears active2012–presentKnown forFigmaThiel FellowSpouseElena Nadolinski Dylan Field (born 1992) is an American technology executive and co-founder of Figma, a web-based vector graphics editing software company. Field founded Figma in 2012 with Evan Wallace, who he had met while the two were computer scien...

 

Sault-Saint-Remy Entidad subnacional Sault-Saint-RemyLocalización de Sault-Saint-Remy en FranciaCoordenadas 49°25′35″N 4°09′42″E / 49.426388888889, 4.1616666666667Entidad Comuna de Francia • País Francia • Región Champaña-Ardenas • Departamento Ardenas • Distrito Distrito de Rethel • Cantón Cantón de Asfeld • Mancomunidad Communauté de communes de l'AsfeldoisAlcalde Mme Mireille Gatinois(2001-2008)Superficie  ...

 

Symfonie nr. 1 Componist Joseph Haydn Soort compositie symfonie Toonsoort D majeur Andere aanduiding Hoboken I/1 Compositiedatum 1759 Volgende werk Symfonie nr. 2 Oeuvre Symfonieën van Joseph Haydn Portaal    Klassieke muziek De Symfonie nr. 1 is de eerste symfonie van Joseph Haydn, voltooid in 1757 in Lukaveč. Haydn zelf bestempelde dit werk als zijn eerste symfonie en bracht het werk in verband met zijn aanstelling bij graaf Morzin. Bezetting 2 hobo's 2 fagotten 2 hoorns Strijke...

2014 film directed by Divya Khosla Kumar YaariyanTheatrical release posterDirected byDivya Khosla KumarWritten bySanjeev DuttaProduced byBhushan KumarStarringHimansh KohliRakul Preet SinghEvelyn SharmaNicole Faria[1]CinematographySameer AryaEdited byAarif SheikhMusic byPritamAnupam AmodMithoonArkoYo Yo Honey SinghProductioncompanyT-SeriesDistributed byAA FilmsRelease date 10 January 2014 (2014-01-10) Running time145 minutes[2]CountryIndiaLanguageHindiBudget₹10...

 

2015 American filmScooby-Doo! Moon Monster MadnessDVD coverDirected byPaul McEvoyWritten byMark BankerBased onScooby-Dooby William Hanna, Joseph Barbera, Iwao Takamoto, Joe Ruby, & Ken SpearsProduced byPaul McEvoySam RegisterStarringFrank WelkerMindy CohnGrey GriffinMatthew LillardMusic byAndy SturmerProductioncompanyWarner Bros. AnimationDistributed byWarner Home VideoRelease dates February 3, 2015 (2015-02-03) (Digital HD) February 17, 2015 (2015-02-17)...

 

الألعاب الإفريقية 2003 البلد نيجيريا  المدينة المضيفة أبوجا الدول المشاركة 53 التاريخ 2003  المكان أبوجا  الرياضة رياضات أولمبية  الأحداث رياضة حفل الافتتاح 5 أكتوبر حفل الاختتام 17 أكتوبر المفتتح الرسمي أولوسيجون أوباسانجو الملعب الرئيسي ملعب أبوجا الألعاب الإفريق...

Railway station in Tochigi, Tochigi Prefecture, Japan Ienaka Station家中駅Ienaka Station east exit in August 2021General informationLocation5897-9 Ienaka Tsuga-machi, Tochigi-shi, Tochigi-ken 328-0111JapanCoordinates36°25′48″N 139°44′50″E / 36.4299°N 139.7473°E / 36.4299; 139.7473Operated by Tobu RailwayLine(s) Tobu Nikko LineDistance52.4 km from Tōbu-Dōbutsu-KōenPlatforms1 island platformOther informationStation codeTN-14WebsiteOfficial websiteHistor...

 

Not to be confused with Mohammad Aslam Khan (disambiguation). Major GeneralMuhammad Aslam KhanBorn(1923-04-06)6 April 1923Peshawar, British IndiaDiedJanuary 22, 1994(1994-01-22) (aged 70)Allegiance British India (1944-47  Pakistan (1947-84)Service/branch British Indian Army  Pakistan ArmyYears of service1944–1984Rank Major GeneralBattles/warsWorld War IIIndo-Pakistani War of 1965Awards Hilal-i-Jurat Hilal-i-Imtiaz Sitara-e-Imtiaz Muhammad Aslam Khan (6 April 19...

 

Railway loop line in Tokyo, Japan Yamanote LineJYYamanote Line E235 series EMUs in March 2019OverviewNative name山手線OwnerJR EastLocaleTokyo, JapanTerminiShinagawa (loop)Stations30Color on map  Yellow-green (#9acd32) ServiceTypeHeavy railOperator(s) JR EastDepot(s)Tokyo General Rolling Stock Centre (near Ōsaki Station)Rolling stockE235 seriesHistoryOpened1 March 1885; 138 years ago (1885-03-01)TechnicalLine length34.5 km (21.4 mi)Number of tracksDouble-t...

American pornographic actress and model (1986-2017) Yurizan BeltranBeltran in 2013BornYurizan Beltrán(1986-11-02)November 2, 1986Los Angeles, California, U.S.DiedDecember 13, 2017(2017-12-13) (aged 31)Bellflower, California, U.S.Cause of deathDrug overdoseOther namesYuri LuvOccupations pornographic actress model actress Years active2005–2017 Yurizan Beltrán (November 2, 1986 – December 13, 2017) was an American pornographic actress, model, and mainstream actress.[...

 

Railway station in Takaoka, Toyama Prefecture, Japan Fushiki Station伏木駅Fushiki Station in July 2018General informationLocation1 Fushikifurukokufu, Takaoka, Toyama 933-0112JapanCoordinates36°47′33″N 137°03′31″E / 36.7924°N 137.0587°E / 36.7924; 137.0587Operated by JR WestLine(s)■ Himi LineDistance7.3 km from TakaokaPlatforms1 island platformTracks2ConstructionStructure typeAt gradeOther informationStatusStaffed (Midori no Madoguchi)WebsiteOfficial we...

 

American professional wrestler and actor Dallas Page redirects here. For the cricketer, see Dallas Page (cricketer). For other uses of DDP, see DDP. Diamond Dallas PagePage in 2016BornPage Joseph Falkinburg Jr. (1956-04-05) April 5, 1956 (age 67)[1]Point Pleasant, New Jersey, U.S.[1]Occupations Professional wrestler actor wrestling manager fitness instructor Years active1988–1991 (wrestling manager)1991–2020 (wrestler)1999–present (actor)2012–present (fitness...

United StatesNickname(s)Team USAAssociationUSA HockeyHead coachLiz Keady NortonAssistantsBrianna DeckerMeredith RothZoe HickelCaptainJoy DunneTop scorerKendall Coyne (22)Most pointsKendall Coyne (33)Team colors     IIHF codeUSAFirst international United States 11–0 Russia (Calgary, Canada; January 7, 2008)Biggest win United States 18–0 Czech Republic (Füssen, Germany; January 9, 2009)Biggest defeat Canada 5–1 United States (Budapest, Hungary...

 

Voce principale: Vicenza Calcio. S.S. Lanerossi VicenzaStagione 1975-1976 Sport calcio SquadraVicenza Calcio Allenatore Manlio Scopigno (1ª-20ª) Chinesinho (21ª-38ª) Presidente Giuseppe Farina Serie B16º posto Coppa ItaliaPrimo turno Maggiori presenzeCampionato: Prestanti (37) Miglior marcatoreCampionato: Galuppi (6) StadioRomeo Menti Media spettatori7 904[1]¹ 1974-1975 1976-1977 ¹ considera le partite giocate in casa in campionato.Si invita a seguire il modello di voce Qu...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!