Box plot

Figure 1. Box plot of data from the Michelson experiment

In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.[1] In addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset[2] may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution[3] (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

History

The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in 1952[4] and again in her book "Practical Charting Techniques" in 1969.[5] The box-and-whisker plot was first introduced in 1970 by John Tukey, who later published on the subject in his book "Exploratory Data Analysis" in 1977.[6]

Elements

Figure 2. Box-plot with whiskers from minimum to maximum
Figure 3. Same box-plot with whiskers drawn within the 1.5 IQR value

A boxplot is a standardized way of displaying the dataset based on the five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

  • Minimum (Q0 or 0th percentile): the lowest data point in the data set excluding any outliers
  • Maximum (Q4 or 100th percentile): the highest data point in the data set excluding any outliers
  • Median (Q2 or 50th percentile): the middle value in the data set
  • First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), it is the median of the lower half of the dataset.
  • Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), it is the median of the upper half of the dataset.[7]

In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below:

A box-plot usually includes two parts, a box and a set of whiskers as shown in Figure 2.

Box

The box is drawn from Q1 to Q3 with a horizontal line drawn inside it to denote the median. Some box plots include an additional character to represent the mean of the data.[8][9]

Whiskers

The whiskers must end at an observed data point, but can be defined in various ways. In the most straightforward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box-plot.

Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile (Q3), a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (Q1) and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as outliers.[10] The outliers can be plotted on the box-plot as a dot, a small circle, a star, etc. (see example below).

There are other representations in which the whiskers can stand for several other things, such as:

  • One standard deviation above and below the mean of the data set
  • The 9th percentile and the 91st percentile of the data set
  • The 2nd percentile and the 98th percentile of the data set

Rarely, box-plot can be plotted without the whiskers. This can be appropriate for sensitive information to avoid whiskers (and outliers) disclosing actual values observed.[11]

The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the seven-number summary. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.

Variations

Figure 4. Four box plots, with and without notches and variable width

Since the mathematician John W. Tukey first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable-width box plots and the notched box plots shown in Figure 4.

Variable-width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.[12]

Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.[12] The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is an uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).[12] The width of the notch is arbitrarily chosen to be visually pleasing, and should be consistent amongst all box plots being displayed on the same page.

One convention for obtaining the boundaries of these notches is to use a distance of around the median.[13]

Adjusted box plots are intended to describe skew distributions, and they rely on the medcouple statistic of skewness.[14] For a medcouple value of MC, the lengths of the upper and lower whiskers on the box-plot are respectively defined to be:

For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box-plot to the Tukey's box-plot with equal whisker lengths of for both whiskers.

Other kinds of box plots, such as the violin plots and the bean plots can show the difference between single-modal and multimodal distributions, which cannot be observed from the original classical box-plot.[6]

Examples

Example without outliers

Figure 5. The generated boxplot figure of the example on the left with no outliers

A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.

A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median (Q2), first quartile (Q1), and third quartile (Q3).

The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57°F.

The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81°F.

The median is the "middle" number of the ordered data set. This means that exactly 50% of the elements are below the median and 50% of the elements are greater than the median. The median of this ordered data set is 70°F.

The first quartile value (Q1 or 25th percentile) is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57°F and 70°F is 66°F.

The third quartile value (Q3 or 75th percentile) is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70°F and 81°F is 75°F.

The interquartile range, or IQR, can be calculated by subtracting the first quartile value (Q1) from the third quartile value (Q3):

Hence,

1.5 IQR above the third quartile is:

1.5 IQR below the first quartile is:

The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5°F and the maximum is 81°F. Therefore, the upper whisker is drawn at the value of the maximum, which is 81°F.

Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5°F and the minimum is 57°F. Therefore, the lower whisker is drawn at the value of the minimum, which is 57°F.

Example with outliers

Figure 6. The generated boxplot of the example on the left with outliers

Above is an example without outliers. Here is a followup example for generating box-plot with outliers:

The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.

In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.

In this case, the maximum value in this data set is 89°F, and 1.5 IQR above the third quartile is 88.5°F. The maximum is greater than 1.5 IQR plus the third quartile, so the maximum is an outlier. Therefore, the upper whisker is drawn at the greatest value smaller than 1.5 IQR above the third quartile, which is 79°F.

Similarly, the minimum value in this data set is 52°F, and 1.5 IQR below the first quartile is 52.5°F. The minimum is smaller than 1.5 IQR minus the first quartile, so the minimum is also an outlier. Therefore, the lower whisker is drawn at the smallest value greater than 1.5 IQR below the first quartile, which is 57°F.

In the case of large datasets

An additional example for obtaining box-plot from a data set containing a large number of data points is:

General equation to compute empirical quantiles

Here stands for the general ordering of the data points (i.e. if , then )

Using the above example that has 24 data points (n = 24), one can calculate the median, first and third quartile either mathematically or visually.

Median

First quartile

Third quartile

Visualization

Figure 7. Box-plot and a probability density function (pdf) of a Normal N(0,1σ2) Population

Although box plots may seem more primitive than histograms or kernel density estimates, they do have a number of advantages. First, the box plot enables statisticians to do a quick graphical examination on one or more data sets. Box-plots also take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data in parallel (see Figure 1 for an example). Lastly, the overall structure of histograms and kernel density estimate can be strongly influenced by the choice of number and width of bins techniques and the choice of bandwidth, respectively.

Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,σ2) distribution and observe their characteristics directly (as shown in Figure 7).

Figure 8. Box-plots displaying the skewness of the data set

See also

References

  1. ^ C., Dutoit, S. H. (2012). Graphical exploratory data analysis. Springer. ISBN 978-1-4612-9371-2. OCLC 1019645745.{{cite book}}: CS1 maint: multiple names: authors list (link)
  2. ^ Grubbs, Frank E. (February 1969). "Procedures for Detecting Outlying Observations in Samples". Technometrics. 11 (1): 1–21. doi:10.1080/00401706.1969.10490657. ISSN 0040-1706.
  3. ^ Richard., Boddy (2009). Statistical Methods in Practice : for Scientists and Technologists. John Wiley & Sons. ISBN 978-0-470-74664-6. OCLC 940679163.
  4. ^ Spear, Mary Eleanor (2024). Charting Statistics. McGraw Hill. p. 166.
  5. ^ Spear, Mary Eleanor. (1969). Practical charting techniques. New York: McGraw-Hill. ISBN 0070600104. OCLC 924909765.
  6. ^ a b Wickham, Hadley; Stryjewski, Lisa. "40 years of boxplots" (PDF). Retrieved December 24, 2020.
  7. ^ Holmes, Alexander; Illowsky, Barbara; Dean, Susan (31 March 2015). "Introductory Business Statistics". OpenStax. Archived from the original on 27 July 2020. Retrieved 29 April 2020.
  8. ^ Frigge, Michael; Hoaglin, David C.; Iglewicz, Boris (February 1989). "Some Implementations of the Boxplot". The American Statistician. 43 (1): 50–54. doi:10.2307/2685173. JSTOR 2685173.
  9. ^ Marmolejo-Ramos, F.; Tian, S. (2010). "The shifting boxplot. A boxplot based on essential summary statistics around the mean". International Journal of Psychological Research. 3 (1): 37–46. doi:10.21500/20112084.823. hdl:10819/6492.
  10. ^ Dekking, F.M. (2005). A Modern Introduction to Probability and Statistics. Springer. pp. 234–238. ISBN 1-85233-896-2.
  11. ^ Derrick, Ben; Green, Elizabeth; Ritchie, Felix; White, Paul (September 2022). "The Risk of Disclosure When Reporting Commonly Used Univariate Statistics". Privacy in Statistical Databases. Lecture Notes in Computer Science. Vol. 13463. pp. 119–129. doi:10.1007/978-3-031-13945-1_9. ISBN 978-3-031-13944-4.
  12. ^ a b c McGill, Robert; Tukey, John W.; Larsen, Wayne A. (February 1978). "Variations of Box Plots". The American Statistician. 32 (1): 12–16. doi:10.2307/2683468. JSTOR 2683468.
  13. ^ "R: Box Plot Statistics". R manual. Retrieved 26 June 2011.
  14. ^ Hubert, M.; Vandervieren, E. (2008). "An adjusted boxplot for skewed distribution". Computational Statistics and Data Analysis. 52 (12): 5186–5201. CiteSeerX 10.1.1.90.9812. doi:10.1016/j.csda.2007.11.008.

Further reading

  • Beeswarm Boxplot - superimposing a frequency-jittered stripchart on top of a box plot

Read other articles:

Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Desember 2022. SD HamadaInformasiJenisSekolah SwastaAlamatLokasi, Batam, Kepri,  IndonesiaMoto SD Hamada, merupakan salah satu Sekolah Dasar swasta yang ada di Batam, Provinsi Kepulauan Riau. Sama dengan SD pada umumnya di Indonesia masa pendidikan sekolah di S...

 

Edessa, kota kelahiran Bardesanes Bardesanes atau Bardaisan (juga dikenal dengan Bardesanes dari Edessa) (154-222) adalah seorang bangsawan, sastrawan, dan filsuf yang berkebangsaan Suriah dan tokoh besar dalam kekeristenan di Edessa.[1] Orang tuanya adalah bangsawan Persia yang menyingkir ke Edessa karena konspirasi di istana.[1] Bardesanes sendiri lahir di Edessa pada tahun 154.[1] Ia belajar agama zoroaster dan tumbuh bersama temannya yang kelak menjadi Raja Abgar V...

 

Location of Yodogawa-ku in Osaka City Osaka headquarters of Nissin Foods Yodogawa (淀川区, Yodogawa-ku) is one of 24 wards of Osaka, Japan. It is located in the north of the city. Economy Nissin Foods has its corporate headquarters in Yodogawa-ku.[1][2] The company moved to its current headquarters in 1977, when the construction of the building was completed.[3] SNK a video game company which first headquartered in Suita nearby Esaka Station has relocated to Yodogaw...

New Zealand sprinter Lumley, c. 1937 Doreen Lumley (21 August 1921 – 1 October 1939) was a New Zealand sprinter of the 1930s from Auckland. Doreen Lumley represented New Zealand in the 1938 British Empire Games in the 100 yard and 220 yard events.[1] Doreen and her sister Bernice were educated at Auckland Girls' Grammar School, taking part in athletics, basketball, swimming and tennis; and then worked as shorthand-typists. The sisters were killed in a road accident in Auckland...

 

Ne doit pas être confondu avec Coupe du monde de biathlon 2016-2017. IBU Cup 2016-2017 Généralités Sport Biathlon Organisateur(s) Union internationale de biathlon Éditions 9e Lieu(x) Europe Date 25 novembre 2016 - 12 mars 2017 Épreuves 20 Site web officiel biathlonworld.com Palmarès Vainqueur Alexey Volkov Daria Virolaynen Navigation Édition précédente Édition suivante modifier L'IBU Cup 2016/2017 est la neuvième édition de l'IBU Cup de biathlon. Programme BeitostølenBeitostøle...

 

Брук-ан-дер-ЛайтаBruck an der Leitha Герб Координати 48°01′31″ пн. ш. 16°46′44″ сх. д. / 48.02550000002777608° пн. ш. 16.77897222224977725° сх. д. / 48.02550000002777608; 16.77897222224977725Координати: 48°01′31″ пн. ш. 16°46′44″ сх. д. / 48.02550000002777608° пн. ш. 16.77897222224977725° сх. 

Cet article est une ébauche concernant une chanson, le Concours Eurovision de la chanson et Monaco. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. À chacun sa chanson Chanson de Line et Willy auConcours Eurovision de la chanson 1968 Sortie 1968 Langue Français Genre Chanson française, pop Auteur Roland Valade Compositeur Jean-Claude Oliver Classement 7e (ex-aequo) (8 points) Chansons représentant Monac...

 

Silke Scheuermann stellt auf dem Erlanger Poetenfest 2014 ihren Gedichtband „Skizze vom Gras“ vor. Silke Scheuermann (* 15. Juni 1973 in Karlsruhe) ist eine deutsche Schriftstellerin. Inhaltsverzeichnis 1 Leben und Werk 2 Einzeltitel 3 Herausgabe 4 Auszeichnungen 5 Weblinks 6 Forschungsliteratur 7 Einzelnachweise Leben und Werk Nach dem Abitur studierte Silke Scheuermann Theater- und Literaturwissenschaften in Frankfurt am Main, Leipzig und Paris. Sie verfasst Lyrik und Prosa, die in zahl...

 

هذه مقالة غير مراجعة. ينبغي أن يزال هذا القالب بعد أن يراجعها محرر مغاير للذي أنشأها؛ إذا لزم الأمر فيجب أن توسم المقالة بقوالب الصيانة المناسبة. يمكن أيضاً تقديم طلب لمراجعة المقالة في الصفحة المخصصة لذلك. (سبتمبر 2023) الأدب[1] الفرنسي في القرن التاسع عشر يتعلق الأدب الف...

1996 EP by Morning AgainThe Cleanest WarEP by Morning AgainReleasedMay 15, 1996 (1996-05-15)RecordedJanuary 1996 (1996-01)StudioStudio 13, Deerfield Beach, Florida, United StatesGenreMetallic hardcore[1]Length20:46LabelConquer the WorldProducerJeremy StaskaMorning Again chronology The Cleanest War(1996) As Tradition Dies Slowly(1998) The Cleanest War is the debut extended play by American metallic hardcore band Morning Again. It was released on May 15,...

 

Elm cultivar Ulmus laevis 'Helena'SpeciesUlmus laevisCultivar'Helena'OriginEurope The European White Elm cultivar Ulmus laevis 'Helena' is a Dutch introduction in commerce circa 2010 at the Boomkwekerij s'Herenland at Randwijk (PBR applied for: EU 20142249).[1] The cultivar was cloned from a tree planted as one of a line of 17 White Elms at Eibergen circa 1900, which developed a straight central leader. The 17 U. laevis at Eibergen Description The tree is described as having upright, ...

 

Annual swimming competition The European Junior Swimming Championships (50 m) is an annual swimming competition for European swimmers organized by the Ligue Européenne de Natation and held over five days.[1] The competitor age for females was 15 to 16 years; for males it is 17 to 18 years until 2015.[1] From 2016 the competitor age is for females 14 to 17 years and for males 15 to 18 years.[2] History Until 1989 the European Junior Diving Championships was held togeth...

This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Subconscious Lobotomy – news · newspapers · books · scholar · JSTOR (October 2010) (Learn how and when to remove this template message) 1992 studio album by CentinexSubconscious LobotomyStudio album by CentinexReleasedOctober 1992 (1992-10)RecordedJuly 1992Genr...

 

For the British television series, see Relative Strangers (TV series). 2006 American filmRelative StrangersTheatrical release posterDirected byGreg GliennaWritten byGreg GliennaPeter StassProduced byRam BergmanDanny DeVitoBrian R. EttingJosh H. EttingLati GrobmanStarringDanny DeVitoKathy BatesRon LivingstonNeve CampbellCinematographyTim SuhrstedtEdited byJacqueline CambasMusic byDavid KitayJoseph PullinProductioncompaniesGarlin PicturesJersey FilmsDistributed byFirst Look StudiosRelease dates...

 

Figure 1: Bayes-optimal classification error probability e b {\displaystyle e_{b}} and Bayes discriminability index d b ′ {\displaystyle d'_{b}} between two univariate histograms computed from their overlap area. Figure 2: Same computed from the overlap volume of two bivariate histograms. Figure 3: discriminability indices of two univariate normal distributions with unequal variances. The classification boundary is in black. Figure 4: discriminability indices of two bivariate normal di...

Kawin MassalGenre Drama Roman SutradaraDoddy DjanasPemeran Agnes Monica Glenn Alinskie Jonas Rivanno Kimberly Ryder Dina Lorenza Yadi Timo Ivanka Suwandi Nungki Kusumastuti Penggubah lagu temaAgnes MonicaLagu pembukaGodai Aku Lagi — Agnes MonicaLagu penutupGodai Aku Lagi — Agnes MonicaNegara asalIndonesiaBahasa asliBahasa IndonesiaJmlh. musim1Jmlh. episode25ProduksiProduser eksekutifElly Yanti NoorProduserLeo SutantoPengaturan kameraMulti-kameraDurasi60 menitRumah produksiSinemArtDistribu...

 

Cricket team For the Bangladeshi team, see South Zone cricket team (Bangladesh). The South Zone cricket team is a first-class cricket team that represents southern India in the Duleep Trophy and Deodhar Trophy. It is a composite team of players from seven first-class Indian teams from southern India competing in the Ranji Trophy: Andhra Pradesh, Goa, Hyderabad, Karnataka, Kerala, Tamil Nadu and Pondicherry. South Zone has the third strongest track record of all the zones in the Duleep Trophy,...

 

artikel ini perlu dirapikan agar memenuhi standar Wikipedia. Tidak ada alasan yang diberikan. Silakan kembangkan artikel ini semampu Anda. Merapikan artikel dapat dilakukan dengan wikifikasi atau membagi artikel ke paragraf-paragraf. Jika sudah dirapikan, silakan hapus templat ini. (Pelajari cara dan kapan saatnya untuk menghapus pesan templat ini) Artikel atau sebagian dari artikel ini mungkin diterjemahkan dari Manajer tim nasional sepak bola Inggris di en.wikipedia.org. Isinya masih belum ...

Der britische Premierminister Neville Chamberlain zeigt unmittelbar nach seiner Rückkehr aus München auf dem Flughafen Heston bei London die mit NS-Deutschland getroffene Vereinbarung. (30. September 1938) Appeasement-Politik (von englisch to appease, französisch apaiser, ‚besänftigen‘, ‚beschwichtigen‘, ‚beruhigen‘; auch Beschwichtigungspolitik genannt) bezeichnet eine Politik der Zugeständnisse, der Zurückhaltung, der Beschwichtigung und des Entgegenkommens gegenüber Aggr...

 

Star in the constellation Serpens Gliese 710 Observation dataEpoch J2000      Equinox J2000 Constellation Serpens Right ascension 18h 19m 50.8412s[1] Declination −01° 56′ 19.005″[1] Apparent magnitude (V) 9.66[2] (9.65–9.69)[3] Characteristics Spectral type K7 Vk[4] U−B color index +1.26[2] B−V color index +1.37[2] Variable type Suspected[3]...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!