Theil–Sen estimator

The Theil–Sen estimator of a set of sample points with outliers (black line) compared to the non-robust ordinary least squares line for the same set (blue). The dashed green line represents the ground truth from which the samples were generated.

In non-parametric statistics, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane (simple linear regression) by choosing the median of the slopes of all lines through pairs of points. It has also been called Sen's slope estimator,[1][2] slope selection,[3][4] the single median method,[5] the Kendall robust line-fit method,[6] and the Kendall–Theil robust line.[7] It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively,[8] and after Maurice Kendall because of its relation to the Kendall tau rank correlation coefficient.[9]

Theil-Sen regression has several advantages over Ordinary least squares regression. It is insensitive to outliers. It can be used for significance tests even when residuals are not normally distributed.[10] It can be significantly more accurate than non-robust simple linear regression (least squares) for skewed and heteroskedastic data, and competes well against least squares even for normally distributed data in terms of statistical power.[11] It has been called "the most popular nonparametric technique for estimating a linear trend".[2] There are fast algorithms for efficiently computing the parameters.

Definition

As defined by Theil (1950), the Theil–Sen estimator of a set of two-dimensional points (xi, yi) is the median m of the slopes (yjyi)/(xjxi) determined by all pairs of sample points. Sen (1968) extended this definition to handle the case in which two data points have the same x coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct x coordinates.[8]

Once the slope m has been determined, one may determine a line from the sample points by setting the y-intercept b to be the median of the values yimxi. The fit line is then the line y = mx + b with coefficients m and b in slope–intercept form.[12] As Sen observed, this choice of slope makes the Kendall tau rank correlation coefficient become approximately zero, when it is used to compare the values xi with their associated residuals yimxib. Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of b does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.[9]

A confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points[13] and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.[11]

Variations

A variation of the Theil–Sen estimator, the repeated median regression of Siegel (1982), determines for each sample point (xi, yi), the median mi of the slopes (yjyi)/(xjxi) of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical.[14]

A different variant pairs up sample points by the rank of their x-coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the Theil–Sen estimator.[15]

Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose x-coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.[16]

For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs.[17]

Statistical properties

The Theil–Sen estimator is an unbiased estimator of the true slope in simple linear regression.[18] For many distributions of the response error, this estimator has high asymptotic efficiency relative to least-squares estimation.[19] Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators.

The Theil–Sen estimator is more robust than the least-squares estimator because it is much less sensitive to outliers. It has a breakdown point of

meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy.[12] However, the breakdown point decreases for higher-dimensional generalizations of the method.[20] A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the repeated median estimator of Siegel.[12]

The Theil–Sen estimator is equivariant under every linear transformation of its response variable, meaning that transforming the data first and then fitting a line, or fitting a line first and then transforming it in the same way, both produce the same result.[21] However, it is not equivariant under affine transformations of both the predictor and response variables.[20]

Algorithms

The median slope of a set of n sample points may be computed exactly by computing all O(n2) lines through pairs of points, and then applying a linear time median finding algorithm. Alternatively, it may be estimated by sampling pairs of points. This problem is equivalent, under projective duality, to the problem of finding the crossing point in an arrangement of lines that has the median x-coordinate among all such crossing points.[22]

The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in computational geometry. Several different methods are known for computing the Theil–Sen estimator exactly in O(n log n) time, either deterministically[3] or using randomized algorithms.[4] Siegel's repeated median estimator can also be constructed in the same time bound.[23] In models of computation in which the input coordinates are integers and in which bitwise operations on integers take constant time, the Theil–Sen estimator can be constructed even more quickly, in randomized expected time .[24]

An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets.[25]

Implementations

In the R statistics package, both the Theil–Sen estimator and Siegel's repeated median estimator are available through the mblm library.[26] A free standalone Visual Basic application for Theil–Sen estimation, KTRLine, has been made available by the US Geological Survey.[27] The Theil–Sen estimator has also been implemented in Python as part of the SciPy and scikit-learn libraries.[28]

Applications

Theil–Sen estimation has been applied to astronomy due to its ability to handle censored regression models.[29] In biophysics, Fernandes & Leblanc (2005) suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors".[30] For measuring seasonal environmental data such as water quality, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data.[17] In computer science, the Theil–Sen method has been used to estimate trends in software aging.[31] In meteorology and climatology, it has been used to estimate the long-term trends of wind occurrence and speed.[32]

See also

Notes

  1. ^ Gilbert (1987).
  2. ^ a b El-Shaarawi & Piegorsch (2001).
  3. ^ a b Cole et al. (1989); Katz & Sharir (1993); Brönnimann & Chazelle (1998).
  4. ^ a b Dillencourt, Mount & Netanyahu (1992); Matoušek (1991); Blunck & Vahrenhold (2006).
  5. ^ Massart et al. (1997)
  6. ^ Sokal & Rohlf (1995); Dytham (2011).
  7. ^ Granato (2006)
  8. ^ a b Theil (1950); Sen (1968)
  9. ^ a b Sen (1968); Osborne (2008).
  10. ^ Helsel, Dennis R.; Hirsch, Robert M.; Ryberg, Karen R.; Archfield, Stacey A.; Gilroy, Edward J. (2020). Statistical methods in water resources. Techniques and Methods. Reston, VA: U.S. Geological Survey. p. 484. Retrieved 2020-05-22.
  11. ^ a b Wilcox (2001).
  12. ^ a b c Rousseeuw & Leroy (2003), pp. 67, 164.
  13. ^ For determining confidence intervals, pairs of points must be sampled with replacement; this means that the set of pairs used in this calculation includes pairs in which both points are the same as each other. These pairs are always outside the confidence interval, because they do not determine a well-defined slope value, but using them as part of the calculation causes the confidence interval to be wider than it would be without them.
  14. ^ Logan (2010), Section 8.2.7 Robust regression; Matoušek, Mount & Netanyahu (1998)
  15. ^ De Muth (2006).
  16. ^ Jaeckel (1972); Scholz (1978); Sievers (1978); Birkes & Dodge (1993).
  17. ^ a b Hirsch, Slack & Smith (1982).
  18. ^ Sen (1968), Theorem 5.1, p. 1384; Wang & Yu (2005).
  19. ^ Sen (1968), Section 6; Wilcox (1998).
  20. ^ a b Wilcox (2005).
  21. ^ Sen (1968), p. 1383.
  22. ^ Cole et al. (1989).
  23. ^ Matoušek, Mount & Netanyahu (1998).
  24. ^ Chan & Pătraşcu (2010).
  25. ^ Bagchi et al. (2007).
  26. ^ Logan (2010), p. 237; Vannest, Davis & Parker (2013)
  27. ^ Vannest, Davis & Parker (2013); Granato (2006)
  28. ^ SciPy community (2015); Persson & Martins (2016)
  29. ^ Akritas, Murphy & LaValley (1995).
  30. ^ Fernandes & Leblanc (2005).
  31. ^ Vaidyanathan & Trivedi (2005).
  32. ^ Romanić et al. (2014).

References

Read other articles:

Roman general and politician (236/235–183 BC) For other uses, see Scipio Africanus (disambiguation) and Publius Cornelius Scipio. Scipio AfricanusBust likely of Scipio Africanus (formerly identified as Sulla), originally found near his family tomb[1]Born236 or 235 BCRomeDiedc. 183 BCLiternumNationalityRomanKnown forDefeating HannibalOffice Proconsul (Spain, 216–210 BC) Consul (205 BC) Proconsul (Africa, 204–201 BC) Censor (199 BC) Consul ...

 

مينامياواجي    علم   الإحداثيات 34°17′40″N 134°46′47″E / 34.294444444444°N 134.77983333333°E / 34.294444444444; 134.77983333333  [1] تاريخ التأسيس 11 يناير 2005  تقسيم إداري  البلد اليابان[2][3]  التقسيم الأعلى هيوغو  خصائص جغرافية  المساحة 229.01 كيلومتر مربع (1 أكتوبر 2018...

 

نوركمعلوماتالنوع معطفتعديل - تعديل مصدري - تعديل ويكي بيانات النَوْرَك[1] أو البَرْكَة[2] هو نوعٌ من المعاطف ذات الغطاء، وغالبًا ما تُبَطَّنُ بالفراء أو الفراء الصناعي. النَوْرَك ملبس أساسي في ملابس الإنويت، المصنوعة تقليدياً من جلد الرنة أو الفقمة. تُلبَس عند الصيد

يفتقر محتوى هذه المقالة إلى الاستشهاد بمصادر. فضلاً، ساهم في تطوير هذه المقالة من خلال إضافة مصادر موثوق بها. أي معلومات غير موثقة يمكن التشكيك بها وإزالتها. (ديسمبر 2018) حارة المقالح  - حارة -  تقسيم إداري البلد  اليمن المحافظة محافظة صنعاء المديرية مديرية ضواحي ا

 

The Brown House, 1935 The Brown House (Jerman: Braunes Haus) adalah markas besar dari Partai Sosialis Nasional (Nationalsozialistische Deutsche Arbeiterpartei) di Jerman. Bangunan yang terletak di di jalan 45 Brienner Straße di Munich, Bavaria ini memiliki struktur batu yang besar dan mengesankan dan diberi nama sesuai dengan warna seragam partai. Pada 1930, kantor pusat partai di jalan Schellingstrasse 50 terlalu kecil (dengan jumlah pekerja meningkat dari hanya empat pada 1925 menjadi 50 p...

 

Joseph JohnsonJoseph JohnsonLahir(1738-11-15)15 November 1738LondonMeninggal20 Desember 1809(1809-12-20) (umur 71)LondonPekerjaanpenjual buku, penerbit Joseph Johnson (15 November 1738 – 20 Desember 1809) adalah penjual buku dan penerbit London dari abad ke-18. Johnson dikenal karena menerbitkan karya-karya pemikir radikal seperti Mary Wollstonecraft, William Godwin, Thomas Malthus, Joel Barlow, Priscilla Wakefield, Joseph Priestley, Anna Laetitia Barbauld, dan Gilbert Wakefield. Johns...

Measurement of X-ray absorption of a material as a function of energy This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (June 2019) (Learn how and when to remove this template message) Three regions of XAS data Extended X-ray absorption fine structure (EXAFS), along with X-ray absorption near edge structure (XANES), is a subset of X-ray absorption spectroscopy (XAS). Like o...

 

Nogizaka46Informasi latar belakangAsalNogizaka, Tokyo, JepangGenrePopTahun aktif2011– sekarangLabelSony Music Records (2011- )Artis terkaitSakamichi SeriesSitus webwww.nogizaka46.com Nogizaka46 (乃木坂46, Nogizaka Forty-six) adalah grup idola wanita Jepang yang diproduseri oleh Yasushi Akimoto, yang diciptakan sebagai saingan resmi (公式ライバル, kōshiki raibaru) dari grup AKB48. Mereka adalah grup pertama dari Seri Sakamichi, yang juga mencakup grup saudari Sakurazaka46 (sebelumn...

 

1 Samuel 8Kitab Samuel (Kitab 1 & 2 Samuel) lengkap pada Kodeks Leningrad, dibuat tahun 1008.KitabKitab 1 SamuelKategoriNevi'imBagian Alkitab KristenPerjanjian LamaUrutan dalamKitab Kristen9← pasal 7 pasal 9 → 1 Samuel 8 (atau I Samuel 8, disingkat 1Sam 8) adalah bagian dari Kitab 1 Samuel dalam Alkitab Ibrani dan Perjanjian Lama di Alkitab Kristen. Dalam Alkitab Ibrani termasuk Nabi-nabi Awal atau Nevi'im Rishonim [נביאים ראשונים] dalam bagian Nevi'im (נביאי...

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此條目已列出參考文獻,但文內引註不足,部分內容的來源仍然不明。 (2023年3月13日)请加上合适的文內引註来改善此条目。 此條目或其章節极大或完全地依赖于某个单一的来源。 (2023年3月13日)请协助補充多方面可靠来源以改善这篇条目。致使用者:请搜索一下条目的标题(来源搜索:國立嘉義高級工...

 

This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Kyrie Vivaldi – news · newspapers · books · scholar · JSTOR (October 2014) (Learn how and when to remove this template message) Kyrie, RV 587 The Kyrie in G minor (RV 587) by Antonio Vivaldi is a setting of the Kyrie for two cori (two orchestras, each with respective four-p...

 

1983 filmThe Salt PrinceTheatrical release poster by V. VavrekGermanSoľ nad zlato, Der Salzprinz Directed byMartin HollýWritten byMartin HollýPeter KováčikStarringLibuše ŠafránkováKarol MachataGábor Nagy (actor)Jozef KronerCinematographyDodo ŠimončičEdited byMaximilián RemeňMusic byKarel SvobodaProductioncompaniesSlovenský film Koliba, Omnia film MunchenDistributed bySlovenská požičovňa filmov BratislavaRelease date February 27, 1983 (1983-02-27) Running tim...

1968 studio album by Scott WalkerScott 2Studio album by Scott WalkerReleasedMarch 1968 (1968-03)[1]July 1968 (1968-07) (US)Recorded1967–1968GenreBaroque popLength43:47LabelPhilips, SmashProducerJohn FranzScott Walker chronology Scott(1967) Scott 2(1968) Scott 3(1969) Singles from Scott 2 Jackie b/w The PlagueReleased: 1967 Professional ratingsReview scoresSourceRatingAllMusic[2]Pitchfork Media(9.1/10)[3] Scott 2 is the second solo album by...

 

Racing car constructor ApollonFull nameApollonFounder(s)Loris KesselNoted drivers Loris KesselFormula One World Championship careerFirst entry1977 Italian Grand PrixRaces entered1 (no starts) [1]EnginesCosworthRace victories0Points0Pole positions0Fastest laps0Final entry1977 Italian Grand Prix Apollon was a Formula One racing car constructor from Switzerland. The team participated in one Formula One World Championship Grand Prix but failed to qualify.[2] The team was formed by...

 

Indonesian entrepreneur (born 1981) William TanuwijayaBorn (1981-11-11) 11 November 1981 (age 42)Pematang Siantar, North Sumatra, IndonesiaNationalityIndonesianAlma materBina Nusantara UniversityKnown forCo-Founder, TokopediaSpouseFelicia HW William Tanuwijaya (born 11 November 1981) is an Indonesian entrepreneur. He is the co-founder of Tokopedia, an Indonesian technology company with a leading e-Commerce business. Tanuwijaya represents Indonesia as Young Global Leader at Worl...

For other people named John McCrea, see John McCrea (disambiguation). John Livingstone McCreaBorn(1891-05-29)May 29, 1891Marlette, Michigan, U.S.DiedJanuary 25, 1990(1990-01-25) (aged 98)[1]Needham, Massachusetts, U.S.Education 1915, USNA 1923, Naval War College 1929, LL.B., GWU Law 1934, LL.M., GWU Law SpouseMartha (?–his death)Children Meredith Coyne Annie Sullivan stepson, Philip H. Tobey stepdaughter, Julia C. Tobey Military careerAllegiance United States NavyYears...

 

Sarah Jane Smith Personaje de Doctor Who y The Sarah Jane Adventures Elisabeth Sladen, intérprete de Sarah Jane SmithPrimera aparición The Time Warrior (1973)Última aparición The Hand of Fear (como regular en Doctor Who, 1976)El fin del tiempo (como invitada en Doctor Who, 2010)The Man Who Never Was (en The Sarah Jane Adventures, 2011)Interpretado por Elisabeth SladenDoblador en España Elena de Maeztu (serie clásica)Olga Cano (serie moderna)Información personalEstatus actual MuertaNaci...

 

Raphael Schäfer Datos personalesNacimiento Kędzierzyn-Koźle, Polonia30 de enero de 1979 (44 años)Nacionalidad(es) Altura 1,90 metrosCarrera deportivaDeporte FútbolClub profesionalDebut deportivo 1996(Hannover 96)Posición PorteroRetirada deportiva 2017[editar datos en Wikidata] Raphael Schäfer (Kędzierzyn-Koźle, Polonia, nació el 30 de enero de 1979) es un exfutbolista alemán. Jugaba de portero. Clubes Club País Año Hannover 96 Alemania Alemania 1996-1998 VfB L...

American blues pianist and singer Little Johnny JonesBirth nameJohnnie JonesBorn(1924-11-01)November 1, 1924Jackson, Mississippi, U.S.DiedNovember 19, 1964(1964-11-19) (aged 40)Chicago, IllinoisGenresBluesOccupation(s)MusicianInstrument(s)Vocals, piano, harmonicaYears active1946–1964Musical artist Little Johnny Jones (born Johnnie Jones; November 1, 1924 – November 19, 1964)[1] was an American Chicago blues pianist and singer, best known for his work with Tampa R...

 

Expulsion of a fetus from the pregnant mother's uterus This article is about birth in humans. For birth in non-human mammals and other animals, see Birth. Medical conditionChildbirthOther namesLabour and delivery, partus, giving birth, parturition, birth, confinement[1][2]Mother and newborn baby shown with vernix caseosa coveringSpecialtyObstetrics, midwiferyComplicationsObstructed labour, postpartum bleeding, eclampsia, postpartum infection, birth asphyxia, neonatal hypotherm...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!