Share to: share facebook share twitter share wa share telegram print page

Document classification

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.

"Content-based" versus "request-based" classification

Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned.[1] In automatic classification it could be the number of times given words appears in a document.

Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230[2]).

Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.

Classification versus indexing

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21[3]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore, the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.

Automatic document classification (ADC)

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification,[8] where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.[9][10][11][12][13][14]

Techniques

Automatic document classification techniques include:

Applications

Classification techniques have been applied to

  • spam filtering, a process which tries to discern E-mail spam messages from legitimate emails
  • email routing, sending an email sent to a general address to a specific address or mailbox depending on topic[15]
  • language identification, automatically determining the language of a text
  • genre classification, automatically determining the genre of a text[16]
  • readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system
  • sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
  • health-related classification using social media in public health surveillance [17]
  • article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology [18]

See also

References

  1. ^ Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")
  2. ^ Soergel, Dagobert (1985). Organizing information: Principles of data base and retrieval systems. Orlando, FL: Academic Press.
  3. ^ Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.
  4. ^ Aitchison, J. (1986). "A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure." Journal of Documentation, Vol. 42 No. 3, pp. 160-181.
  5. ^ Aitchison, J. (2004). "Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule." Bliss Classification Bulletin, Vol. 46, pp. 20-26.
  6. ^ Broughton, V. (2008). "A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification (2nd Ed.).]" Axiomathes, Vol. 18 No.2, pp. 193-210.
  7. ^ Riesthuis, G. J. A., & Bliedung, St. (1991). "Thesaurification of the UDC." Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.
  8. ^ Rossi, R. G., Lopes, A. d. A., and Rezende, S. O. (2016). Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. Information Processing & Management, 52(2):217–257.
  9. ^ "An Interactive Automatic Document Classification Prototype" (PDF). Archived from the original (PDF) on 2017-11-15. Retrieved 2017-11-14.
  10. ^ Interactive Automatic Document Classification Prototype Archived April 24, 2015, at the Wayback Machine
  11. ^ Document Classification - Artsyl
  12. ^ ABBYY FineReader Engine 11 for Windows
  13. ^ Classifier - Antidot
  14. ^ "3 Document Classification Methods for Tough Projects". www.bisok.com. Retrieved 2021-08-04.
  15. ^ Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). Message classification in the call center. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158–165, ACL.
  16. ^ Santini, Marina; Rosso, Mark (2008), Testing a Genre-Enabled Application: A Preliminary Assessment (PDF), BCS IRSG Symposium: Future Directions in Information Access, London, UK, pp. 54–63, archived from the original (PDF) on 2019-11-15, retrieved 2011-10-21{{citation}}: CS1 maint: location missing publisher (link)
  17. ^ X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7. doi:10.1109/SECON.2017.7925400
  18. ^ Krallinger, M; Leitner, F; Rodriguez-Penagos, C; Valencia, A (2008). "Overview of the protein-protein interaction annotation extraction task of Bio Creative II". Genome Biology. 9 (Suppl 2): S4. doi:10.1186/gb-2008-9-s2-s4. PMC 2559988. PMID 18834495.

Further reading

Read other articles:

2007 Canadian filmAmalposterDirected byRichie MehtaWritten byRichie MehtaShaun MehtaProduced bySteven BrayStarringRupinder NagraNaseeruddin ShahSeema BiswasKoel PurieVik SahayRoshan SethCinematographyMitchell NessEdited byStuart A. McIntyreMusic byDr. Shiva[1]Distributed bySeville PicturesRelease dates September 13, 2007 (2007-09-13) (Toronto International Film Festival) August 8, 2008 (2008-08-08) (Canada) Running time101 minutesCountryCanadaLang...

Austin Motor Company Tipe Industri otomotif Nasib Digabungkan, Merk ini terbengkalai dan mungkin akan digunakan kembali. PenerusBritish Motor Corporation Didirikan 1905 Ditutup 1952 Lokasi Longbridge, Birmingham, England, UK IndustriOtomotif Produkotomotif IndukBritish Motor Corporation Austin Motors showroom, Long Acre, London, circa 1910 Austin Motor Company adalah sebuah perusahaan otomotif dari Inggris yang berdiri pada tahun 1905 dan diberhentikan pada tahun 1952. Didirikan tahun 1905 ol...

Lago di VelenceStato Ungheria RegioneTransdanubio Centrale ConteaFejér Coordinate47°12′30″N 18°36′00″E / 47.208333°N 18.6°E47.208333; 18.6Coordinate: 47°12′30″N 18°36′00″E / 47.208333°N 18.6°E47.208333; 18.6 Altitudine100 m s.l.m. DimensioniSuperficie26 km² Lunghezza10,8 km Larghezza3,3 km Profondità massima1,6 m Volume0,03984 km³ Lago di Velence Modifica dati su Wikidata · Manuale Vista ...

Japanese dish; waffle rice cakeMoffles A moffle is a Japanese dish consisting of mochi rice cake cooked in a waffle iron, which creates a waffle.[1][2] A typical cooked moffle has a crunchy exterior with a thin interior layer of glutinous mochi.[3] When prepared as a dessert, it is typically served with various condiments.[1][2] It is also prepared as a snack food using ingredients such as ham and cheese or cod roe.[4] Sanyei Company claims to h...

Groß Elbe Gemeinde Elbe Koordinaten: 52° 5′ N, 10° 16′ O52.08976310.267609135Koordinaten: 52° 5′ 23″ N, 10° 16′ 3″ O Höhe: ca. 135 m ü. NN Einwohner: 804 (30. Sep. 1998)[1] Eingemeindung: 1. März 1974 Postleitzahl: 38274 Vorwahl: 05345 Karte Lage von Groß Elbe in der Gemeinde Elbe Südwestlicher Ortseingang von Groß ElbeSüdwestlicher Ortseingang von Groß Elbe Groß Elbe ist der grö...

Опис Пам'ятник Тарасові Шевченку (Львів) англ. Taras Hryhorovych Shevchenko (March 9 1814 – March 10 1861) was a Ukrainian poet and artist. He is also known under the name Kobzar after his most famous literary work, a collection of poems entitled Kobzar. His literary heritage is regarded to be the foundation of modern Ukrainian literature and, to a large extent, the modern Ukrainian language. Shevchenko also wrote in Russian. Shevchenko is also known for...

Dieser Artikel behandelt die Region. Siehe auch: Central America, Schiff, bzw. Zentralamerikanische Konföderation, Staatenbund. Karte der zentralamerikanischen Staaten Zentralamerika bezeichnet im geographischen Sinn die Landbrücke in der Mitte des amerikanischen Doppelkontinents. Es wird aufgrund seiner zentralen Lage teilweise nicht zu Nordamerika gezählt, in der Regel jedoch dem Norden des amerikanischen Doppelkontinents zugerechnet. Zusammen mit den Westindischen Inseln beziehungsweise...

Річкова ескадра Міссісіпі обстрілює укріплення конфедератів на Острові № 10 7 квітня 1862 року. Річкова ескадра Міссісіпі (англ. Mississippi River Squadron) — річкова ескадра Союзу, яка діяла на Міссісіпі та її притоках під час Громадянської війни у США. Вона  була створена як формуван

Aksara BiblosJenis aksara Tidak terbaca (mungkin aksara silabis atau Abugida) BahasaTidak diketahuiPeriodeDiperkirakan antara 1800 SM dan 1400 SMAksara terkaitSilsilahHieroglif Mesir/Aksara HieratikAksara Biblos Artikel ini mengandung transkripsi fonetik dalam Alfabet Fonetik Internasional (IPA). Untuk bantuan dalam membaca simbol IPA, lihat Bantuan:IPA. Untuk penjelasan perbedaan [ ], / / dan ⟨ ⟩, Lihat IPA § Tanda kurung dan delimitasi transkr...

Artikel ini mengenai Presiden Amerika Serikat. Untuk anggota kongres asal Ohio, lihat James G. Polk dan kapal selam lihat USS James Knox Polk James Knox PolkPresiden Amerika Serikat ke-11Masa jabatan4 Maret 1845 – 4 Maret 1849Wakil PresidenGeorge Mifflin DallasPendahuluJohn TylerPenggantiZachary TaylorGubernur TennesseeMasa jabatan14 Oktober 1839 – 15 Oktober 1841PendahuluNewton CannonPenggantiJames JonesKetua Dewan Perwakilan Rakyat Amerika Serikat ke-17Masa jabatan...

Bass member of the trumpet family of brass instruments Bass trumpetBass trumpet in C with 4 rotary valvesBrass instrumentClassification WindBrassAerophoneHornbostel–Sachs classification423.233.2(Valved aerophone sounded by lip vibration with cylindrical bore longer than 2 metres)DevelopedEarly 19th centuryRelated instruments TrumpetContrabass trumpetValve tromboneMusicians Willie ColónPhilip JonesJohnny MandelElliot MasonLeonhard PaulRaymond PremruRashawn RossCy TouffBuilders AlexanderBach...

Wakil Wali Kota YogyakartaLambang Kota YogyakartaPetahanaHeroe Poerwadisejak 22 Mei 2017Masa jabatan5 tahunDibentuk2001Pejabat pertamaSyukri FadholiSitus webwww.jogjakota.go.id Wakil Wali Kota Yogyakarta adalah pemimpin kedua tertinggi di lingkungan Pemerintah Kota Yogyakarta. Berikut ini adalah daftar wakil wali kota yang pernah menjabat di Kota Yogyakarta sejak 2001. Daftar No Wakil Wali Kota Bertugas Ket. Wali Kota Mulai Menjabat Akhir Menjabat 1 Syukri Fadholi 2001 2006 Herry Zudiant...

1941 novella by Stefan Zweig The Royal Game Elke Rehder: Woodcut to the chess story The Royal GameAuthorStefan ZweigOriginal titleSchachnovelleWorking titleThe Royal GameLanguageGermanGenreNovellaPublication date1942 The Royal Game (also known as Chess Story; in the original German Schachnovelle, Chess Novella) is a novella by the Austrian author Stefan Zweig written in 1941, the year before the author's death by suicide.[1] In some editions, the title is used for a collecti...

Antibiotic Benzathine benzylpenicillinCombination ofBenzylpenicillinantibioticBenzathinestabilizerClinical dataTrade namesBicillin L-A,[1] Permapen, othersOther namespenicillin benzathine benzyl, benzathine penicillin, penicillin G benzathine, benethamine penicilline, benzylpenicillin benzathine[2]AHFS/Drugs.comProfessional Drug FactsLicense data US DailyMed: Penicillin_G_benzathine Pregnancycategory AU: A Routes ofadministrationIntramuscular injection[3]...

Normalisasi Kehidupan Kampus (NKK)/ Badan Koordinasi Kemahasiswaan (BKK) merupakan salah satu kebijakan pemerintah Orde Baru di bawah Menteri Pendidikan saat itu, Daoed Joesoef. Tujuan dari kebijakan ini adalah untuk mengembalikan marwah akademik di kalangan mahasiswa yang saat itu banyak dikuasai kepentingan politik terutama organisasi ekstrauniversiter yang tergabung di dalam Kelompok Cipayung. Latar Belakang[1][2][3][4][5] Latar belakang sejarah dari...

本條目存在以下問題,請協助改善本條目或在討論頁針對議題發表看法。 此條目需要擴充。 (2010年6月12日)请協助改善这篇條目,更進一步的信息可能會在討論頁或扩充请求中找到。请在擴充條目後將此模板移除。 此條目没有列出任何参考或来源。 (2008年1月11日)維基百科所有的內容都應該可供查證。请协助補充可靠来源以改善这篇条目。无法查证的內容可能會因為異議提出...

Fachada del Hotel Empress de Victoria, donde se perdió la pista a Emma Fillipoff el 28 de noviembre de 2012. Emma Fillipoff (Ontario; 6 de enero de 1986) es una joven canadiense que desapareció frente al Hotel Empress de la ciudad de Victoria (Columbia Británica) el 28 de noviembre de 2012, a los 26 años de edad.[1]​ Trasfondo Fillipoff había regresado a Victoria en otoño del 2011 desde Perth (Ontario). De vuelta, estuvo trabajando un breve tiempo en el restaurante de mariscos Red...

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) The topic of this article may not meet Wikipedia's notability guideline for sports and athletics. Please help to demonstrate the notability of the topic by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond a mere trivial mention. If notability cannot be shown, the article is lik...

House of the Cuban government Central Committee of the Communist Party of the Republic of CubaComité Central del Partido Comunista de la República de CubaPalace of the RevolutionGeneral informationAddress Havana CubaCurrent tenantsMiguel Díaz-Canel(First Secretary and President)Construction started1943Completed1957OwnerGovernment of CubaDesign and constructionArchitect(s)Pérez Benoita The Palace of the Revolution (Spanish: Palacio de la Revolución), is a palace in Havana, Cuba within the...

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: SD Sengokuden Tenka Touitsu Hen – news · newspapers · books · scholar · JSTOR (May 2018) (Learn how and whe...

Kembali kehalaman sebelumnya