Mixture distribution

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors (each having the same dimension), in which case the mixture distribution is a multivariate distribution.

In cases where each of the underlying random variables is continuous, the outcome variable will also be continuous and its probability density function is sometimes referred to as a mixture density. The cumulative distribution function (and the probability density function if it exists) can be expressed as a convex combination (i.e. a weighted sum, with non-negative weights that sum to 1) of other distribution functions and density functions. The individual distributions that are combined to form the mixture distribution are called the mixture components, and the probabilities (or weights) associated with each component are called the mixture weights. The number of components in a mixture distribution is often restricted to being finite, although in some cases the components may be countably infinite in number. More general cases (i.e. an uncountable set of component distributions), as well as the countable case, are treated under the title of compound distributions.

A distinction needs to be made between a random variable whose distribution function or density is the sum of a set of components (i.e. a mixture distribution) and a random variable whose value is the sum of the values of two or more underlying random variables, in which case the distribution is given by the convolution operator. As an example, the sum of two jointly normally distributed random variables, each with different means, will still have a normal distribution. On the other hand, a mixture density created as a mixture of two normal distributions with different means will have two peaks provided that the two means are far enough apart, showing that this distribution is radically different from a normal distribution.

Mixture distributions arise in many contexts in the literature and arise naturally where a statistical population contains two or more subpopulations. They are also sometimes used as a means of representing non-normal distributions. Data analysis concerning statistical models involving mixture distributions is discussed under the title of mixture models, while the present article concentrates on simple probabilistic and statistical properties of mixture distributions and how these relate to properties of the underlying distributions.

Finite and countable mixtures

Density of a mixture of three normal distributions (μ = 5, 10, 15, σ = 2) with equal weights. Each component is shown as a weighted density (each integrating to 1/3)

Given a finite set of probability density functions p1(x), ..., pn(x), or corresponding cumulative distribution functions P1(x), ..., Pn(x) and weights w1, ..., wn such that wi ≥ 0 and Σwi = 1, the mixture distribution can be represented by writing either the density, f, or the distribution function, F, as a sum (which in both cases is a convex combination):

This type of mixture, being a finite sum, is called a finite mixture, and in applications, an unqualified reference to a "mixture density" usually means a finite mixture. The case of a countably infinite set of components is covered formally by allowing .

Uncountable mixtures

Where the set of component distributions is uncountable, the result is often called a compound probability distribution. The construction of such distributions has a formal similarity to that of mixture distributions, with either infinite summations or integrals replacing the finite summations used for finite mixtures.

Consider a probability density function p(x;a) for a variable x, parameterized by a. That is, for each value of a in some set A, p(x;a) is a probability density function with respect to x. Given a probability density function w (meaning that w is nonnegative and integrates to 1), the function

is again a probability density function for x. A similar integral can be written for the cumulative distribution function. Note that the formulae here reduce to the case of a finite or infinite mixture if the density w is allowed to be a generalized function representing the "derivative" of the cumulative distribution function of a discrete distribution.

Mixtures within a parametric family

The mixture components are often not arbitrary probability distributions, but instead are members of a parametric family (such as normal distributions), with different values for a parameter or parameters. In such cases, assuming that it exists, the density can be written in the form of a sum as:

for one parameter, or

for two parameters, and so forth.

Properties

Convexity

A general linear combination of probability density functions is not necessarily a probability density, since it may be negative or it may integrate to something other than 1. However, a convex combination of probability density functions preserves both of these properties (non-negativity and integrating to 1), and thus mixture densities are themselves probability density functions.

Moments

Let X1, ..., Xn denote random variables from the n component distributions, and let X denote a random variable from the mixture distribution. Then, for any function H(·) for which exists, and assuming that the component densities pi(x) exist,

The jth moment about zero (i.e. choosing H(x) = xj) is simply a weighted average of the jth moments of the components. Moments about the mean H(x) = (x − μ)j involve a binomial expansion:[1]

where μi denotes the mean of the ith component.

In the case of a mixture of one-dimensional distributions with weights wi, means μi and variances σi2, the total mean and variance will be:

These relations highlight the potential of mixture distributions to display non-trivial higher-order moments such as skewness and kurtosis (fat tails) and multi-modality, even in the absence of such features within the components themselves. Marron and Wand (1992) give an illustrative account of the flexibility of this framework.[2]

Modes

The question of multimodality is simple for some cases, such as mixtures of exponential distributions: all such mixtures are unimodal.[3] However, for the case of mixtures of normal distributions, it is a complex one. Conditions for the number of modes in a multivariate normal mixture are explored by Ray & Lindsay[4] extending earlier work on univariate[5][6] and multivariate[7] distributions.

Here the problem of evaluation of the modes of an n component mixture in a D dimensional space is reduced to identification of critical points (local minima, maxima and saddle points) on a manifold referred to as the ridgeline surface, which is the image of the ridgeline function

where belongs to the -dimensional standard simplex: and correspond to the covariance and mean of the ith component. Ray & Lindsay[4] consider the case in which showing a one-to-one correspondence of modes of the mixture and those on the ridge elevation function thus one may identify the modes by solving with respect to and determining the value .

Using graphical tools, the potential multi-modality of mixtures with number of components is demonstrated; in particular it is shown that the number of modes may exceed and that the modes may not be coincident with the component means. For two components they develop a graphical tool for analysis by instead solving the aforementioned differential with respect to the first mixing weight (which also determines the second mixing weight through ) and expressing the solutions as a function so that the number and location of modes for a given value of corresponds to the number of intersections of the graph on the line . This in turn can be related to the number of oscillations of the graph and therefore to solutions of leading to an explicit solution for the case of a two component mixture with (sometimes called a homoscedastic mixture) given by

where is the Mahalanobis distance between and .

Since the above is quadratic it follows that in this instance there are at most two modes irrespective of the dimension or the weights.

For normal mixtures with general and , a lower bound for the maximum number of possible modes, and – conditionally on the assumption that the maximum number is finite – an upper bound are known. For those combinations of and for which the maximum number is known, it matches the lower bound.[8]

Examples

Two normal distributions

Simple examples can be given by a mixture of two normal distributions. (See Multimodal distribution#Mixture of two normal distributions for more details.)

Given an equal (50/50) mixture of two normal distributions with the same standard deviation and different means (homoscedastic), the overall distribution will exhibit low kurtosis relative to a single normal distribution – the means of the subpopulations fall on the shoulders of the overall distribution. If sufficiently separated, namely by twice the (common) standard deviation, so these form a bimodal distribution, otherwise it simply has a wide peak.[9] The variation of the overall population will also be greater than the variation of the two subpopulations (due to spread from different means), and thus exhibits overdispersion relative to a normal distribution with fixed variation though it will not be overdispersed relative to a normal distribution with variation equal to variation of the overall population.

Alternatively, given two subpopulations with the same mean and different standard deviations, the overall population will exhibit high kurtosis, with a sharper peak and heavier tails (and correspondingly shallower shoulders) than a single distribution.

A normal and a Cauchy distribution

The following example is adapted from Hampel,[10] who credits John Tukey.

Consider the mixture distribution defined by

F(x)   =   (1 − 10−10) (standard normal) + 10−10 (standard Cauchy).

The mean of i.i.d. observations from F(x) behaves "normally" except for exorbitantly large samples, although the mean of F(x) does not even exist.

Applications

Mixture densities are complicated densities expressible in terms of simpler densities (the mixture components), and are used both because they provide a good model for certain data sets (where different subsets of the data exhibit different characteristics and can best be modeled separately), and because they can be more mathematically tractable, because the individual mixture components can be more easily studied than the overall mixture density.

Mixture densities can be used to model a statistical population with subpopulations, where the mixture components are the densities on the subpopulations, and the weights are the proportions of each subpopulation in the overall population.

Mixture densities can also be used to model experimental error or contamination – one assumes that most of the samples measure the desired phenomenon, with some samples from a different, erroneous distribution.

Parametric statistics that assume no error often fail on such mixture densities – for example, statistics that assume normality often fail disastrously in the presence of even a few outliers – and instead one uses robust statistics.

In meta-analysis of separate studies, study heterogeneity causes distribution of results to be a mixture distribution, and leads to overdispersion of results relative to predicted error. For example, in a statistical survey, the margin of error (determined by sample size) predicts the sampling error and hence dispersion of results on repeated surveys. The presence of study heterogeneity (studies have different sampling bias) increases the dispersion relative to the margin of error.

See also

Mixture

Hierarchical models

Notes

  1. ^ Frühwirth-Schnatter (2006, Ch.1.2.4)
  2. ^ Marron, J. S.; Wand, M. P. (1992). "Exact Mean Integrated Squared Error". The Annals of Statistics. 20 (2): 712–736. doi:10.1214/aos/1176348653., http://projecteuclid.org/euclid.aos/1176348653
  3. ^ Frühwirth-Schnatter (2006, Ch.1)
  4. ^ a b Ray, R.; Lindsay, B. (2005), "The topography of multivariate normal mixtures", The Annals of Statistics, 33 (5): 2042–2065, arXiv:math/0602238, doi:10.1214/009053605000000417
  5. ^ Robertson CA, Fryer JG (1969) Some descriptive properties of normal mixtures. Skand Aktuarietidskr 137–146
  6. ^ Behboodian, J (1970). "On the modes of a mixture of two normal distributions". Technometrics. 12: 131–139. doi:10.2307/1267357. JSTOR 1267357.
  7. ^ Carreira-Perpiñán, M Á; Williams, C (2003). On the modes of a Gaussian mixture (PDF). Published as: Lecture Notes in Computer Science 2695. Springer-Verlag. pp. 625–640. doi:10.1007/3-540-44935-3_44. ISSN 0302-9743.
  8. ^ Améndola, C.; Engström, A.; Haase, C. (2020), "Maximum number of modes of Gaussian mixtures", Information and Inference: A Journal of the IMA, 9 (3): 587–600, arXiv:1702.05066, doi:10.1093/imaiai/iaz013
  9. ^ Schilling, Mark F.; Watkins, Ann E.; Watkins, William (2002). "Is human height bimodal?". The American Statistician. 56 (3): 223–229. doi:10.1198/00031300265.
  10. ^ Hampel, Frank (1998), "Is statistics too difficult?", Canadian Journal of Statistics, 26: 497–513, doi:10.2307/3315772, hdl:20.500.11850/145503

References

Read other articles:

Catholic academic fraternity in Leuven, Belgium Katholische Academische Verbindung (K.A.V.) Lovania Leuven is a Catholic academic fraternity, founded in 1896 at the Catholic University of Louvain in Leuven, Belgium. It is a German Studentenverbindung and is an affiliated member of the Cartellverband der katholischen deutschen Studentenverbindungen. Her motto is Semper Excelsius! (Der Geist lebt in uns allen!). Its official coulours (Couleur) are green, white and red. History Crest of K.A.V. L...

 

 

Đài Phát thanh – Truyền hình Kon TumQuốc gia Việt NamKhu vựcphát sóng Việt NamTrụ sởSố 258A, Phan Đình Phùng, TP Kon TumChương trìnhNgôn ngữTiếng Việt, Tiếng Xê Đăng, Tiếng Ba Na, Tiếng Giẻ, Tiếng Triêng, Tiếng Gia RaiĐịnh dạng hình1080i HDTVSở hữuChủ sở hữuỦy ban Nhân dân tỉnh Kon TumLịch sửLên sóng30 tháng 11 năm 1991; 31 năm trước (1991-11-30)Liên kết ngoàiW...

 

 

Железный патриотангл. Iron Patriot Норман Озборн в броне Железного патриота.Художник — Ади Гранов. История публикаций Издатель Marvel Comics Дебют Dark Avengers #1 (март, 2009) Авторы Брайан Майкл Бендис (сценарист) Майк Деодато (художник) Характеристики персонажа Позиция Железный патриот...

Quilín Acceso oriente a la estación.UbicaciónCoordenadas 33°29′17″S 70°34′49″O / -33.487922, -70.580333Dirección Autopista Vespucio Sur con Rotonda QuilínComuna Peñalolén - MaculDatos de la estaciónInauguración 2 de marzo de 2006[1]​N.º de andenes 2N.º de vías 2Operador Metro de SantiagoServicios detalladosClasificación Posición SuperficialLíneas « Los Presidentes ← → Las Torres » [editar datos en Wikidata] Quilín es una estació...

 

 

село Шевченко Країна  Україна Область Дніпропетровська область Район Криворізький район Громада Апостолівська міська громада Облікова картка Облікова картка  Основні дані Населення 424 Поштовий індекс 53812 Телефонний код +380 5656 Географічні дані Географічні координ...

 

 

2006 studio album by Paul MotianOn Broadway Vol. 4 or The Paradox of ContinuityStudio album by Paul MotianReleased2006-06-21RecordedNovember 21–23, 2005GenreJazzLength66:10LabelWinter & WinterProducerStefan WinterPaul Motian chronology I Have the Room Above Her(2004) On Broadway Vol. 4 or The Paradox of Continuity(2006) Time and Time Again(2006) On Broadway Vol. 4 or The Paradox of Continuity is an album by Paul Motian and the Trio 2000 + One released on the German Winter & ...

Subsidiary that manages the Westinghouse brand This article is about the former program for licensing the Westinghouse brand to third parties worldwide when part of ViacomCBS. For historical information, see Westinghouse Electric Corporation. For other uses, see Westinghouse (disambiguation). The topic of this article may not meet Wikipedia's notability guidelines for companies and organizations. Please help to demonstrate the notability of the topic by citing reliable secondary sources that ...

 

 

Martin Heinrich Klaproth マルティン・ハインリヒ・クラプロート(Martin Heinrich Klaproth、1743年12月1日 – 1817年1月1日)は、ドイツの化学者である。 人物 ヴェルニゲローデに生まれた。16歳で薬局につとめ、その後クヴェトリンブルク、ハノーファーなどで薬局の助手を務め、1768年ベルリンにでた。1770年有名な化学者ローゼの助手になったが、その直後にローゼが亡くなったの...

 

 

Lamphunลำพูน Plaats in Thailand Situering Provincie (Changwat) Lamphun Coördinaten 18° 30′ NB, 99° 5′ OL Algemene informatie Inwoners (2000) 43.164 Portaal    Zuidoost-Azië Stadsmuur en stadsgracht Lamphun (Thais: ลำพูน) is een Thaise stad in de regio Noord-Thailand. Lamphun is hoofdstad van de provincie Lamphun en het district Lamphun. De stad telde in 2000 bij de volkstelling 43.164 inwoners. De Ping stroomt door Lamphun. Geschiedenis Lamphun is...

العلاقات الباكستانية السورية     باكستان   سوريا العلاقات الباكستانية السورية تعديل مصدري - تعديل   العلاقات الباكستانية السورية هي العلاقات التاريخية والدولية والثنائية بين سوريا وباكستان. من خلال التبادل القديم للحضارات، كانت مناطق باكستان الحديثة جزءً...

 

 

Building in Rio de Janeiro, BrazilTiradentes PalacePalácio TiradentesGeneral informationStatusSeat of the Legislative Assembly of Rio de JaneiroArchitectural styleEclecticismLocationRio de JaneiroAddressRua Primeiro de Março s/nCountryBrazilCoordinates22°54′13.932317″S 43°10′25.957203″W / 22.90387008806°S 43.17387700083°W / -22.90387008806; -43.17387700083Construction started1922Inaugurated6 May 1926OwnerRio de Janeiro state governmentDesign and construct...

 

 

The POINT Community Development CorporationThe POINT Community Development Corporation's headquarters in Hunts Point, The BronxAbbreviationThe POINT CDCNamed afterHunts Point, BronxFormation1993; 30 years ago (1993)FoundersMaria Torres, Paul Lipson, Mildred Ruiz-Sapp & Steven SappTypeCommunity Development CorporationLegal status501(c)(3)Headquarters940 Garrison Avenue, Bronx, NY 10474-5335LocationThe Bronx, New York CityServicesyouth development theater group, After-scho...

Cet article est une ébauche concernant l’Égypte antique. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Monnaie d'époque ptolémaïque figurant Isis Pharia (tenant une voile de navire) et le phare d'Alexandrie (musée gréco-romain d'Alexandrie). Le temple d'Isis Pharia est un supposé[Qui ?] temple d'Alexandrie voué au culte d'Isis, sous l'épiclèse Pharia, « du phare ». Cette épiclèse...

 

 

Painting by Edward Hopper New York MovieArtistEdward Hopper Year1939Mediumoil paint, canvasLocationMuseum of Modern ArtAccession No.396.1941 [edit on Wikidata] New York Movie is an oil on canvas painting by American painter Edward Hopper. The painting was begun in December of 1938 and finished in January of 1939.[1] Measuring 32 1/4 x 40 1/8, New York Movie depicts a nearly empty movie theater occupied with a few scattered moviegoers and a pensive usherette lost in ...

 

 

Cray-2 в NASA Внутри Cray-2 Cray-2 — векторный суперкомпьютер, выпускавшийся компанией Cray Research с 1985 года. Он был самым производительным компьютером своего времени, обогнав по производительности другой суперкомпьютер, Cray X-MP. Пиковая производительность Cray-2 составляла 1,9 Гфлопс. Т...

Untuk kegunaan lain, lihat Tuak. TuakBerkas:COLLECTIjinhjrjdE TROPENMUSEUM Een palmwijnverkoper en een inheemse soldaat TMnr 3728-732.jpgLitografi pedagang keliling tuak nira dan prajurit pribumi di Hindia Belanda (sekarang Indonesia) karya Auguste van Pers (1854).JenisMinuman beralkohol AsalIndonesia [sunting di Wikidata]lbs Tuak adalah sejenis minuman beralkohol Nusantara yang merupakan hasil fermentasi dari nira, beras, atau bahan minuman/buah yang mengandung gula. Tuak adalah produk m...

 

 

Association football match Football match1992 Football League Fourth Division play-off FinalThe match was played at Wembley Stadium. Blackpool Scunthorpe United 1 1 After extra time Blackpool won 4–3 on penaltiesDate23 May 1992VenueWembley Stadium, LondonRefereeKeith HackettAttendance22,741WeatherHot← 1991 1993 → The 1992 Football League Fourth Division play-off Final was an association football match which was played on 23 May 1992 at Wembley Stadium, London, between Blackpool ...

 

 

This article may be confusing or unclear to readers. Please help clarify the article. There might be a discussion about this on the talk page. (November 2020) (Learn how and when to remove this template message) Arms of Admiral Horatio Nelson, an example of debased heraldry, including such non-heraldic features as a disabled ship and a battery in ruins Debased heraldry is heraldry containing complex, non-standard and non-heraldic charges. They cannot be correctly drawn from the blazon alone, ...

Disintegrated periodic comet 3D/BielaBiela's Comet in February 1846, soon after it split into two piecesDiscoveryDiscovered byWilhelm von BielaDiscovery dateFebruary 27, 1826DesignationsAlternative designations1772; 1806 I; 1832 III; 1846 II; 1852 III;1772 E1; 1826 D1; 1832 S1Orbital characteristics[2][1]EpochSeptember 29, 1852Aphelion6.190 AUPerihelion0.8606 AUSemi-major axis3.5253 AUEccentricity0.7559Orbital period6.619 yrInclination12.550°Last perihelionSeptember...

 

 

American actor (born 1950) Bruce McGillMcGill in 2014BornBruce Travis McGill (1950-07-11) July 11, 1950 (age 73)San Antonio, Texas, U.S.Alma materUniversity of Texas at Austin (B.A., Drama)OccupationActorYears active1977–presentSpouse Gloria Lee ​(m. 1994)​ Bruce Travis McGill (born July 11, 1950) is an American actor. He worked with director Michael Mann in the films The Insider (1999), Ali (2001), and Collateral (2004). McGill's other notable fi...

 

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!