Share to: share facebook share twitter share wa share telegram print page

Chi-squared test

Chi-squared distribution, showing χ2 on the x-axis and p-value (right tail probability) on the y-axis.

A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables (two dimensions of the contingency table) are independent in influencing the test statistic (values within the table).[1] The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

In the standard applications of this test, the observations are classified into mutually exclusive classes. If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a χ2 frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true.

Test statistics that follow a χ2 distribution occur when the observations are independent. There are also χ2 tests for testing the null hypothesis of independence of a pair of random variables based on observations of the pairs.

Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the χ2 distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.

History

In the 19th century, statistical analytical methods were mainly applied in biological data analysis and it was customary for researchers to assume that observations followed a normal distribution, such as Sir George Airy and Mansfield Merriman, whose works were criticized by Karl Pearson in his 1900 paper.[2]

At the end of the 19th century, Pearson noticed the existence of significant skewness within some biological observations. In order to model the observations regardless of being normal or skewed, Pearson, in a series of articles published from 1893 to 1916,[3][4][5][6] devised the Pearson distribution, a family of continuous probability distributions, which includes the normal distribution and many skewed distributions, and proposed a method of statistical analysis consisting of using the Pearson distribution to model the observation and performing a test of goodness of fit to determine how well the model really fits to the observations.

Pearson's chi-squared test

In 1900, Pearson published a paper[2] on the χ2 test which is considered to be one of the foundations of modern statistics.[7] In this paper, Pearson investigated a test of goodness of fit.

Suppose that n observations in a random sample from a population are classified into k mutually exclusive classes with respective observed numbers of observations xi (for i = 1,2,…,k), and a null hypothesis gives the probability pi that an observation falls into the ith class. So we have the expected numbers mi = npi for all i, where

Pearson proposed that, under the circumstance of the null hypothesis being correct, as n → ∞ the limiting distribution of the quantity given below is the χ2 distribution.

Pearson dealt first with the case in which the expected numbers mi are large enough known numbers in all cells assuming every observation xi may be taken as normally distributed, and reached the result that, in the limit as n becomes large, X2 follows the χ2 distribution with k − 1 degrees of freedom.

However, Pearson next considered the case in which the expected numbers depended on the parameters that had to be estimated from the sample, and suggested that, with the notation of mi being the true expected numbers and mi being the estimated expected numbers, the difference

will usually be positive and small enough to be omitted. In a conclusion, Pearson argued that if we regarded X2 as also distributed as χ2 distribution with k − 1 degrees of freedom, the error in this approximation would not affect practical decisions. This conclusion caused some controversy in practical applications and was not settled for 20 years until Fisher's 1922 and 1924 papers.[8][9]

Other examples of chi-squared tests

One test statistic that follows a chi-squared distribution exactly is the test that the variance of a normally distributed population has a given value based on a sample variance. Such tests are uncommon in practice because the true variance of the population is usually unknown. However, there are several statistical tests where the chi-squared distribution is approximately valid:

Fisher's exact test

For an exact test used in place of the 2 × 2 chi-squared test for independence, see Fisher's exact test.

Binomial test

For an exact test used in place of the 2 × 1 chi-squared test for goodness of fit, see binomial test.

Other chi-squared tests

Yates's correction for continuity

Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial frequencies in the table can be approximated by the continuous chi-squared distribution. This assumption is not quite correct and introduces some error.

To reduce the error in approximation, Frank Yates suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the absolute difference between each observed value and its expected value in a 2 × 2 contingency table.[10] This reduces the chi-squared value obtained and thus increases its p-value.

Chi-squared test for variance in a normal population

If a sample of size n is taken from a population having a normal distribution, then there is a result (see distribution of the sample variance) which allows a test to be made of whether the variance of the population has a pre-determined value. For example, a manufacturing process might have been in stable condition for a long period, allowing a value for the variance to be determined essentially without error. Suppose that a variant of the process is being tested, giving rise to a small sample of n product items whose variation is to be tested. The test statistic T in this instance could be set to be the sum of squares about the sample mean, divided by the nominal value for the variance (i.e. the value to be tested as holding). Then T has a chi-squared distribution with n − 1 degrees of freedom. For example, if the sample size is 21, the acceptance region for T with a significance level of 5% is between 9.59 and 34.17.

Example chi-squared test for categorical data

Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:

A B C D Total
White collar 90 60 104 95 349
Blue collar 30 50 51 20 151
No collar 30 40 45 35 150
Total 150 150 200 150 650

Let us take the sample living in neighborhood A, 150, to estimate what proportion of the whole 1,000,000 live in neighborhood A. Similarly we take 349/650 to estimate what proportion of the 1,000,000 are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood A to be

Then in that "cell" of the table, we have

The sum of these quantities over all of the cells is the test statistic; in this case, . Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom is

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence. Here we have a chi-squared value of 24.57, which is quite large, and therefore we have some evidence to reject the null hypothesis(H0). This means each person's neighbourhood of residence is correlated to the person's occupational classification.

A related issue is a test of homogeneity. Suppose that instead of giving every resident of each of the four neighborhoods an equal chance of inclusion in the sample, we decide in advance how many residents of each neighborhood to include. Then each resident has the same chance of being chosen as do all residents of the same neighborhood, but residents of different neighborhoods would have different probabilities of being chosen if the four sample sizes are not proportional to the populations of the four neighborhoods. In such a case, we would be testing "homogeneity" rather than "independence". The question is whether the proportions of blue-collar, white-collar, and no-collar workers in the four neighborhoods are the same. However, the test is done in the same way.

Applications

In cryptanalysis, the chi-squared test is used to compare the distribution of plaintext and (possibly) decrypted ciphertext. The lowest value of the test means that the decryption was successful with high probability.[11][12] This method can be generalized for solving modern cryptographic problems.[13]

In bioinformatics, the chi-squared test is used to compare the distribution of certain properties of genes (e.g., genomic content, mutation rate, interaction network clustering, etc.) belonging to different categories (e.g., disease genes, essential genes, genes on a certain chromosome etc.).[14][15]

See also

References

  1. ^ "Chi-Square - Sociology 3112 - Department of Sociology - The University of utah". soc.utah.edu. Retrieved 2022-11-12.
  2. ^ a b Pearson, Karl (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897.
  3. ^ Pearson, Karl (1893). "Contributions to the mathematical theory of evolution [abstract]". Proceedings of the Royal Society. 54: 329–333. doi:10.1098/rspl.1893.0079. JSTOR 115538.
  4. ^ Pearson, Karl (1895). "Contributions to the mathematical theory of evolution, II: Skew variation in homogeneous material". Philosophical Transactions of the Royal Society. 186: 343–414. Bibcode:1895RSPTA.186..343P. doi:10.1098/rsta.1895.0010. JSTOR 90649.
  5. ^ Pearson, Karl (1901). "Mathematical contributions to the theory of evolution, X: Supplement to a memoir on skew variation". Philosophical Transactions of the Royal Society A. 197 (287–299): 443–459. Bibcode:1901RSPTA.197..443P. doi:10.1098/rsta.1901.0023. JSTOR 90841.
  6. ^ Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation". Philosophical Transactions of the Royal Society A. 216 (538–548): 429–457. Bibcode:1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR 91092.
  7. ^ Cochran, William G. (1952). "The Chi-square Test of Goodness of Fit". The Annals of Mathematical Statistics. 23 (3): 315–345. doi:10.1214/aoms/1177729380. JSTOR 2236678.
  8. ^ Fisher, Ronald A. (1922). "On the Interpretation of χ2 from Contingency Tables, and the Calculation of P". Journal of the Royal Statistical Society. 85 (1): 87–94. doi:10.2307/2340521. JSTOR 2340521.
  9. ^ Fisher, Ronald A. (1924). "The Conditions Under Which χ2 Measures the Discrepancey Between Observation and Hypothesis". Journal of the Royal Statistical Society. 87 (3): 442–450. JSTOR 2341149.
  10. ^ Yates, Frank (1934). "Contingency table involving small numbers and the χ2 test". Supplement to the Journal of the Royal Statistical Society. 1 (2): 217–235. doi:10.2307/2983604. JSTOR 2983604.
  11. ^ "Chi-squared Statistic". Practical Cryptography. Archived from the original on 18 February 2015. Retrieved 18 February 2015.
  12. ^ "Using Chi Squared to Crack Codes". IB Maths Resources. British International School Phuket. 15 June 2014.
  13. ^ Ryabko, B. Ya.; Stognienko, V. S.; Shokin, Yu. I. (2004). "A new test for randomness and its application to some cryptographic problems" (PDF). Journal of Statistical Planning and Inference. 123 (2): 365–376. doi:10.1016/s0378-3758(03)00149-6. Retrieved 18 February 2015.
  14. ^ Feldman, I.; Rzhetsky, A.; Vitkup, D. (2008). "Network properties of genes harboring inherited disease mutations". PNAS. 105 (11): 4323–432. Bibcode:2008PNAS..105.4323F. doi:10.1073/pnas.0701722105. PMC 2393821. PMID 18326631.
  15. ^ "chi-square-tests" (PDF). Archived from the original (PDF) on 29 June 2018. Retrieved 29 June 2018.

Further reading

Read other articles:

معتصم البسطامي معلومات شخصية الاسم الكامل معتصم ماجد البسطامي الميلاد 6 يونيو 1999 (العمر 24 سنة)، الأردن مركز اللعب حارس مرمى الجنسية قطر  معلومات النادي النادي الحالي نادي قطر الرقم 31 مسيرة الشباب سنوات فريق نادي الجيش تعديل مصدري - تعديل   معتصم ماجد بسطامي (مواليد 6 يناي

В Википедии есть статьи о других людях с фамилией Кенни. Энни Кенни Дата рождения 13 сентября 1879(1879-09-13)[1] Место рождения Springhead[d], Сэддлворф[d], Олдем, Большой Манчестер, Великобритания[4] Дата смерти 9 июля 1953(1953-07-09)[1] (73 года) Место смерти Летчуэрт Страна  Ве

ЛемюдLemud   Країна  Франція Регіон Гранд-Ест  Департамент Мозель  Округ Мец Кантон Панж Код INSEE 57392 Поштові індекси 57580 Координати 49°02′21″ пн. ш. 6°22′01″ сх. д.H G O Висота 216 - 242 м.н.р.м. Площа 4,24 км² Населення 539 (01-2020[1]) Густота 72,17 ос./км² Розміщення Влада Ме

العلاقات الأرمينية الباربادوسية أرمينيا باربادوس   أرمينيا   باربادوس تعديل مصدري - تعديل   العلاقات الأرمينية الباربادوسية هي العلاقات الثنائية التي تجمع بين أرمينيا وباربادوس.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتي

Ronnie Barker Algemene informatie Land Vlag van Verenigd Koninkrijk Verenigd Koninkrijk (en) IMDb-profiel (en) TMDb-profiel Portaal    Film Bord bij zijn geboorteplaats Susie Silvey en Ronnie Barker Ronald William George Barker (Bedford, 25 september 1929 — Dean, 3 oktober 2005) was een Engels komiek en acteur, die ook schreef onder het pseudoniem Gerald Wiley. Leven De in Bedfordshire geboren Barker had twee zusters. Zijn vader was klerk bij het oliebedrijf Shell, en het gez...

العلاقات الألبانية الإريترية ألبانيا إريتريا   ألبانيا   إريتريا تعديل مصدري - تعديل   العلاقات الألبانية الإريترية هي العلاقات الثنائية التي تجمع بين ألبانيا وإريتريا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدولتين: وجه ال...

City in Georgia, United StatesMcDonough, GeorgiaCityHenry County Courthouse and Confederate monumentNickname: The Geranium CityLocation in Henry County and the state of GeorgiaMcDonoughLocation of McDonough in Henry CountyShow map of GeorgiaMcDonoughMcDonough (the United States)Show map of the United StatesMcDonoughMcDonough (North America)Show map of North AmericaCoordinates: 33°26′42″N 84°8′57″W / 33.44500°N 84.14917°W / 33.44500; -84.14917Country Un...

Jonas Martin Datos personalesNacimiento Besançon9 de abril de 1990 (33 años)País FranciaNacionalidad(es) FrancesaAltura 1,84 metrosCarrera deportivaDeporte FútbolClub profesionalDebut deportivo 2010(Montpellier H. S. C.)Club Stade Brestois 29Liga Ligue 1Posición CentrocampistaDorsal(es) 28[editar datos en Wikidata] Jonas Martin (Besanzón, Francia, 9 de abril de 1990) es un futbolista francés que juega de centrocampista en el Stade Brestois 29.[1]​[2]​[3]̴...

2021 American documentary film This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: The Last Cruise – news · newspapers · books · scholar · JSTOR (March 2021) (Learn how and when to remove this template message) The Last CruiseFilm posterDirected byHannah OlsonProduced by Hannah Olson Shane Boris Joe Beshenkovsky...

Kanal Banjir pada Peta Tata Air Jakarta (2012). Kanal Banjir Jakarta adalah saluran air kolektor sebagai salah satu cara penanggulangan banjir Jakarta (dulu dikenal dengan nama Batavia) yang pertama kali dikonsepkan oleh Prof. Ir. Hendrik van Breen pada tahun 1913.[1] Inti konsep Kanal Banjir adalah mengendalikan aliran air dari hulu sungai yang berasal dari kawasan Dataran Tinggi Jonggol, Bogor dengan mengatur volume air yang masuk ke kota Jakarta dan akan membuat beban sungai di uta...

1967 studio album by Buffalo SpringfieldBuffalo Springfield AgainStudio album by Buffalo SpringfieldReleasedOctober 30, 1967[1]RecordedJanuary 9 – September 18, 1967[2]Studio Columbia, Sunset Sound and Gold Star, Los Angeles Atlantic, New York City[3] Genre Folk rock[4] psychedelia[5] country folk[6] hard rock[7] Length34:07LabelAtcoProducerStephen Stills, Neil Young, Richie Furay, Dewey Martin, Jim Messina, Jack Nitz...

الدوري البرتغالي الممتاز 2018–19 تفاصيل الموسم الدوري البرتغالي  النسخة 85  البلد البرتغال  التاريخ بداية:10 أغسطس 2018  نهاية:19 مايو 2019  المنظم اتحاد البرتغال لكرة القدم  مباريات ملعوبة 306   عدد المشاركين 18   أهداف مسجلة 543   الدوري البرتغالي الممتاز 2017–18...

Obsolete postulated medium for the propagation of light The luminiferous aether: it was hypothesised that the Earth moves through a medium of aether that carries light Luminiferous aether or ether[1] (luminiferous meaning 'light-bearing') was the postulated medium for the propagation of light.[2] It was invoked to explain the ability of the apparently wave-based light to propagate through empty space (a vacuum), something that waves should not be able to do. The assumption of ...

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Armored train Hurban – news · newspapers · books · scholar · JSTOR (October 2012) (Learn how and when to remove this template message) Armoured train Armored train Hurban A replica of the armored train Hurban now located near Zvolen Castle in Zvolen, Slovakia a...

Ancient Egyptian literary work A raised-relief depiction of Amenemhat I accompanied by deities; the death of Amenemhat I is reported by his son Senusret I in the Story of Sinuhe. Sinuhe in hieroglyphs zꜣ.nht[1] The Story of Sinuhe (also referred to as Sanehat)[2] is a work of ancient Egyptian literature. It was likely composed in the beginning of the Twelfth Dynasty after the death of Amenemhat I (also referred to as Senworset I). The tale describes an Egyptian man who flees...

Larva of a butterfly or moth For other uses, see Caterpillar (disambiguation). Euthalia aconthea (baron butterfly) caterpillar found in India Caterpillar of Papilio machaon A monarch butterfly (Danaus plexippus) caterpillar feeding on an unopened seed pod of swamp milkweed Caterpillars (/ˈkætərpɪlər/ KAT-ər-pil-ər) are the larval stage of members of the order Lepidoptera (the insect order comprising butterflies and moths). As with most common names, the application of the word is arbit...

English family Arms of Verney of Middle Claydon, Buckinghamshire: Azure, on a cross argent five mullets gules Claydon House, Middle Claydon, Buckinghamshire, the Verney family's residence since 1620 Memorial to the Verney family in Middle Claydon parish church, situated next to Claydon House The Verney family purchased the manor of Middle Claydon in Buckinghamshire, England, in the 1460s and still resides there today at the manor house known as Claydon House. This family had been seated previ...

Réserve de vie sauvage des Deux-LacsGéographiePays FranceRégion Rhône-AlpesDépartement DrômeCoordonnées 44° 30′ 09″ N, 4° 41′ 41″ EVille proche Châteauneuf-du-RhôneSuperficie 60 haAdministrationCréation 8 janvier 2016Administration Association pour la protection des animaux sauvagesSite web aspas-reserves-vie-sauvage.org/les-reserves-de-vie-sauvage/reserve-de-vie-sauvage-des-deux-lacsmodifier - modifier le code - modifier Wikidata La Rés...

Open-source intrusion prevention system SnortDeveloper(s)Cisco SystemsStable releaseSnort 2.x (Legacy)2.9.19.0 / December 6, 2021; 24 months ago (2021-12-06)[1]Snort 3.x3.1.56.0 / February 23, 2023; 9 months ago (2023-02-23)[2] Repositorygithub.com/snort3/snort3 Written inC++ (since version 3.0)Operating systemCross-platform[3]TypeIntrusion detection systemIntrusion prevention systemLicenseGPLv2+Websitewww.snort.org Snort is a free o...

Painting by René Magritte The PortraitArtistRené MagritteYear1935MediumOil on canvasDimensions73.3 cm × 50.2 cm (28⅞ in × 19⅞ in)LocationMuseum of Modern Art, New York City The Portrait (1935) is a painting by the Belgian surrealist René Magritte. It depicts an almost photo-realistic table setting with a slice of ham in the center, with an eye staring back at the viewer from the center of the ham. This painting was once part of the private co...

Kembali kehalaman sebelumnya