Local case-control sampling

In machine learning, local case-control sampling [1] is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

Imbalanced datasets

In classification, a dataset is a set of N data points , where is a feature vector, is a label. Intuitively, a dataset is imbalanced when certain important statistical patterns are rare. The lack of observations of certain patterns does not always imply their irrelevance. For example, in medical studies of rare diseases, the small number of infected patients (cases) conveys the most valuable information for diagnosis and treatments.

Formally, an imbalanced dataset exhibits one or more of the following properties:

  • Marginal Imbalance. A dataset is marginally imbalanced if one class is rare compared to the other class. In other words, .
  • Conditional Imbalance. A dataset is conditionally imbalanced when it is easy to predict the correct labels in most cases. For example, if , the dataset is conditionally imbalanced if and .

Algorithm outline

In logistic regression, given the model , the prediction is made according to . The local-case control sampling algorithm assumes the availability of a pilot model . Given the pilot model, the algorithm performs a single pass over the entire dataset to select the subset of samples to include in training the logistic regression model. For a sample , define the acceptance probability as . The algorithm proceeds as follows:

  1. Generate independent for .
  2. Fit a logistic regression model to the subsample , obtaining the unadjusted estimates .
  3. The output model is , where and .

The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the decision boundary of the classifier and is thus more informative.

Obtaining the pilot model

In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a subsample selected through another sampling technique can be used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size , first estimate a model using samples from weighted case control sampling, then collect another samples using local case-control sampling.

Larger or smaller sample size

It is possible to control the sample size by multiplying the acceptance probability with a constant . For a larger sample size, pick and adjust the acceptance probability to . For a smaller sample size, the same strategy applies. In cases where the number of samples desired is precise, a convenient alternative method is to uniformly downsample from a larger subsample selected by local case-control sampling.

Properties

The algorithm has the following properties. When the pilot is consistent, the estimates using the samples from local case-control sampling is consistent even under model misspecification. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with , the factor 2 is improved to .

References

  1. ^ Fithian, William; Hastie, Trevor (2014). "Local case-control sampling: Efficient subsampling in imbalanced data sets". The Annals of Statistics. 42 (5): 1693–1724. arXiv:1306.3706. doi:10.1214/14-aos1220. PMC 4258397. PMID 25492979.

Read other articles:

Artikel ini membutuhkan rujukan tambahan agar kualitasnya dapat dipastikan. Mohon bantu kami mengembangkan artikel ini dengan cara menambahkan rujukan ke sumber tepercaya. Pernyataan tak bersumber bisa saja dipertentangkan dan dihapus.Cari sumber: Ali Moertopo – berita · surat kabar · buku · cendekiawan · JSTOR Ali MoertopoAli Moertopo di acara penutupan FFI 1982Menteri Penerangan Indonesia Ke-21Masa jabatan29 Maret 1978 – 19 Maret 1983P...

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (أبريل 2019) سوني دوفال   معلومات شخصية الميلاد 11 ديسمبر 1972 (51 سنة)  مواطنة كندا  الحياة العملية المهنة كاتب،  ومغني،  وعازف قيثارة،  وكاتب أغاني  اللغات ...

 

2010 studio album by Method Man, Ghostface Killah and RaekwonWu-MassacreStudio album by Method Man, Ghostface Killah and RaekwonReleasedMarch 30, 2010RecordedNovember 2009–February 2010GenreHip hopLength30:05LabelDef Jam RecordingsProducerMethod Man (exec.), Ghostface Killah (exec.), Raekwon (exec.), RZA, Allah Mathematics, Scram Jones, Emile, Ty Fyffe, BT, Digem TracksMethod Man chronology Blackout! 2(2009) Wu-Massacre(2010) The Meth Lab(2015) Ghostface Killah chronology Ghostdini:...

هذه مقالة غير مراجعة. ينبغي أن يزال هذا القالب بعد أن يراجعها محرر مغاير للذي أنشأها؛ إذا لزم الأمر فيجب أن توسم المقالة بقوالب الصيانة المناسبة. يمكن أيضاً تقديم طلب لمراجعة المقالة في الصفحة المخصصة لذلك. (فبراير 2022) هذه المقالة تحتاج للمزيد من الوصلات للمقالات الأخرى للمس

 

Engineering college in Bankura, West Bengal Bankura Unnayani Institute of EngineeringMottoIn Pursuit of Knowledge and ExcellenceTypeA Pvt, TEQIP-II Funded and ISO 9001:2008 Engineering Degree InstituteEstablished1998EndowmentFundeded by Technical Education Quality Improvement Programme (TEQIP) Phase-IIChairmanSasanka DuttaPrincipalDr. Krishnendu AdhvaryuApprovalsAICTE and Department of Higher Education, Ministry of Human Resource Development, Government of India and Directorate of Higher Educ...

 

This article is about the diplomat. For the poet, see John Donne. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Sir John Donne – news · newspapers · books · scholar · JSTOR (December 2022) (Learn how and when to remove this template message) The Donne Triptych by Hans Memling, 1470s, National Gallery, Lond...

1910 encyclopaedia Encyclopædia Britannica Eleventh Edition First page of the Encyclopædia Britannica, Eleventh EditionCountryUnited StatesLanguageBritish EnglishRelease number11SubjectGeneralPublisherHorace Everett HooperPublication date1910–1911Media typePrint and digitalPreceded byEncyclopædia Britannica Tenth Edition Followed byEncyclopædia Britannica Twelfth Edition (supplementary update), Encyclopædia Britannica Fourteenth Edition (full revision) TextEncy...

 

Beauty pageant edition Miss Turkey 2017Aslı Sümen, Miss Turkey 2017, in 2019DateSeptember 21, 2017VenueGrand Hotel de Pera, Istanbul, TurkeyBroadcaster1AN TVEntrants20Placements10WinnerItır Esen (Dethroned)Replaced by Aslı SümenIstanbul - 1 The Miss Turkey 2017 held on September 21, 2017 in the Grand Hotel de Pera in Istanbul, Turkey concluded with the crowning of Itır Esen as the Miss World Turkey 2017.[1] Miss Turkey organizers said the 18-year-old Itir Esen was dethroned on F...

 

Krachi Westconstituencyfor the Parliament of GhanaDistrictKrachi DistrictRegionOti Region of GhanaCurrent constituencyPartyNational Democratic CongressMPHelen Ntoso Krachi West is one of the constituencies represented in the Parliament of Ghana. It elects one Member of Parliament (MP) by the first past the post system of election. Krachi West is located in the Krachi district of the Oti Region of Ghana. Boundaries The seat is located within the Krachi West district of the Volta Region of Ghan...

Mexican rock band MolotovMolotov in 2008Background informationOriginMexico City, MexicoGenresRap metal[1]rock en español[2]Years active1995–presentLabelsUniversal Music Latin EntertainmentMembersIsmael Fuentes de GarayMicky HuidobroPaco AyalaRandy EbrightPast membersJay de la CuevaIván Jared Moreno[3]Websitewww.molotovoficial.mx Molotov is a Mexican rock band formed in Mexico City in 1995. Their lyrics, which are rapped and sung by all members of the group, feature...

 

LGBT rights in KazakhstanKazakhstanStatusYes, homosexuality decriminalised nationwide since late 1997 de facto, since 1998 de jureage of consent is equalised and full legalisation since late 1997[1]Gender identityYes, transgender people allowed to change legal gender following surgery, medical examinations, hormone therapy and sterilisation since 2003MilitaryYes, gays, lesbians and bisexuals allowed to serve in the military since 2022[2]Discrimination protectionsNo law prohibi...

 

Sayur asamSayur asamSajianhidangan utamaTempat asalIndonesia[1]DaerahJakarta, West Java, BantenSuhu penyajianpanas dan suhu ruangBahan utamaberbagai jenis sayuran dalam sup asam  Media: Sayur asam Bumbu untuk sayur asam Sayur asam atau sayur asem adalah masakan sejenis sayur yang khas Indonesia. Ada banyak variasi lokal sayur asam seperti sayur asam Jakarta (variasi dari orang Betawi di Jakarta), sayur asam kangkung (variasi yang menggunakan kangkung), dan sayur asam ikan asi...

This article is about the 1940 film. For the earlier lost film, see You Can't Fool Your Wife (1923 film). 1940 American filmYou Can't Fool Your WifeTheatrical release posterDirected byRay McCareyScreenplay byJerome CadyBased onThe Romantic Mr. Hinklinby Richard CarrollRay McCareyProduced byCliff ReidStarringLucille BallJames EllisonRobert CooteVirginia ValeEmma DunnElaine ShepardCinematographyJ. Roy HuntEdited byTheron WarthMusic byRoy WebbProductioncompanyRKO PicturesDistributed byRKO Pictur...

 

Bridge in VaranasiMalviya Bridge (Dufferin Bridge)Coordinates25°19′21″N 83°02′04″E / 25.322382°N 83.034582°E / 25.322382; 83.034582CrossesGangesLocaleVaranasiCharacteristicsTotal length1048.5 metres[1]HistoryConstruction end1887Location Malviya Bridge, inaugurated in 1887 (originally called The Dufferin Bridge), is a double decker bridge over the Ganges at Varanasi. It carries rail track on lower deck and road on the upper deck. It is one of the maj...

 

Caryn WagnerUnder Secretary of Homeland Security for Intelligence and AnalysisIn officeFebruary 11, 2010 – December 21, 2012PresidentBarack ObamaDeputyWilliam E. Tarry, Jr.Preceded byCharles E. AllenSucceeded byFrancis X. Taylor Personal detailsAlma materCollege of William & MaryUniversity of Southern CaliforniaMilitary serviceAllegiance United StatesBranch/service United States Army Caryn Wagner was the Department of Homeland Security’s Under Secretary of Homeland ...

2000 United States Senate election in Ohio ← 1994 November 7, 2000 2006 → Turnout63.6% (Registered Voters)   Nominee Mike DeWine Ted Celeste Party Republican Democratic Popular vote 2,665,512 1,595,066 Percentage 59.92% 35.85% County results DeWine:      40–50%      50–60%      60–70%      70–80% Celeste:      40–50%   ...

 

Christian classic hits radio network This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: K-Love Classics – news · newspapers · books · scholar · JSTOR (Septemb...

 

الشيخ عيسى أحمد قاسم معلومات شخصية الميلاد 1 يناير 1941 (العمر 83 سنة)الدراز ، البحرين الجنسية بحريني (تم إسقاط جنسيته) الأولاد سامي، محمد حسن، محمد تقي، محمد علي، عبد الحكيم الحياة العملية اللقب أبو سامي، فقيه، آية الله، القائد المهنة عالم دين الحزب جمعية الوفاق الوطني الإسل...

Extinct genus of birds YungavolucrisTemporal range: Maastrichtian~70.6–66 Ma PreꞒ Ꞓ O S D C P T J K Pg N ↓ The holotype right tarsometatarsus of Yungavolucris Scientific classification Domain: Eukaryota Kingdom: Animalia Phylum: Chordata Clade: Dinosauria Clade: Saurischia Clade: Theropoda Clade: Avialae Clade: †Enantiornithes Genus: †YungavolucrisChiappe 1993 Species: †Y. brevipedalis Binomial name †Yungavolucris brevipedalisChiappe 1993 Yungavolucris is...

 

Clarence Ciudad ClarenceUbicación en el condado de Shelby en Misuri Ubicación de Misuri en EE. UU.Coordenadas 39°44′34″N 92°15′40″O / 39.7428, -92.2611Entidad Ciudad • País  Estados Unidos • Estado  Misuri • Condado ShelbySuperficie   • Total 3 km² • Tierra 3 km² • Agua (0%) 0 km²Altitud   • Media 250 m s. n. m.Población (2010)   • Total 813 hab. • Densidad 2...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!