Ngữ liệu văn bản

Ngữ liệu văn bản (tiếng Anh: text corpus) là một tập hợp lớn các văn bản có cấu trúc (thông thường được lưu giữ dạng điện toán và đã xử lý).^[1]^[2]

Một kho ngữ liệu có thể gồm những văn bản bằng một thứ tiếng (ngữ liệu đơn ngữ) hay nhiều thứ tiếng (ngữ liệu đa ngữ). Kho ngữ liệu đa ngữ có thể được sắp xếp theo dạng đối chiếu, gọi là kho ngữ liệu song song. Để có ích hơn cho việc nghiên cứu ngôn ngữ, các kho ngữ liệu thường được đánh dấu. Một ví dụ là việc gán nhãn từ loại (part-of-speech tagging hay là POS-tagging), trong đó các từ được gán nhãn danh từ, động từ, tính từ và nhiều loại từ khác.

Tham khảo

^ What is a corpus? What is corpus linguistics?^{[liên kết hỏng]}, Technische Universität Chemnitz.
^ Language Corpora, The University of Queensland.

Xem thêm

Concordance (publishing)
Corpus linguistics
Linguistic Data Consortium
Xử lý ngôn ngữ tự nhiên
Natural Language Toolkit
Parallel text
Máy truy tìm dữ liệu: có thể truy cập "ngữ liệu web".
Speech corpus
Translation memory
Treebank
Zipf's law

Liên kết ngoài

ACL SIGLEX Resource Links: Text Corpora Lưu trữ 2013-08-13 tại Wayback Machine
Developing Linguistic Corpora: a Guide to Good Practice
Free samples (not free), web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese
Intercorp Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.
Sketch Engine: Open corpora with free access
TS Corpus - A Turkish Corpus freely available for academic research.
Turkish National Corpus - A general-purpose corpus for contemporary Turkish Lưu trữ 2015-04-02 tại Wayback Machine
Corpus of Political Speeches, publicly accessible with speeches from United States, Hong Kong, Taiwan, and China, provided by Hong Kong Baptist University Library
Russian National Corpus Lưu trữ 2019-04-14 tại Wayback Machine

x t s Xử lý ngôn ngữ tự nhiên
Thuật ngữ chung	Hiểu ngôn ngữ tự nhiên Ngữ liệu văn bản Ngữ liệu tiếng nói Từ dừng Mô hình túi từ AI-đầy đủ N-gram (Bigram, Trigram)
Khai thác văn bản	Phân đoạn văn bản Gán nhãn từ loại Phân tích cú pháp sơ bộ Compound-term processing Collocation extraction Stemming Lemmatisation Nhận dạng thực thể có tên Coreference Phân tích tình cảm Khai phá khái niệm Phân tích cú pháp Nhập nhằng Ontology learning Trích xuất thuật ngữ Textual entailment Truecasing
Tóm tắt tự động	Tóm tắt đa văn bản Trích xuất câu Đơn giản hóa văn bản
Dịch tự động	Computer-assisted translation Example-based machine translation Rule-based machine translation Dịch máy bằng nơ-ron
Nhận dạng tự động và thu thập dữ liệu	Nhận dạng tiếng nói Tổng hợp giọng nói Nhận dạng ký tự quang học Sinh ngôn ngữ tự nhiên
Mô hình ngữ nghĩa phân phối	BERT Document-term matrix Explicit semantic analysis fastText GloVe Mô hình ngôn ngữ (lớn) Phân tích ngữ nghĩa tiềm ẩn Seq2seq Vectơ từ Word2vec
Mô hình chủ đề	Phân bổ Pachinko Phân bổ Dirichlet tiềm ẩn Phân tích ngữ nghĩa tiềm ẩn
Xem xét với sự trợ giúp máy tính	Automated essay scoring Concordancer Sửa lỗi chính tả Predictive text Spell checker Syntax guessing
Giao diện người dùng ngôn ngữ tự nhiên	Trợ lý ảo Chatbot Interactive fiction Question answering Giao diện giọng nói người dùng

Bài viết này vẫn còn sơ khai. Bạn có thể giúp Wikipedia mở rộng nội dung để bài được hoàn chỉnh hơn.

Read other articles:

Rendy Septino

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Rendy Septino – berita · surat kabar · buku · cendekiawan · JSTOR Rendy SeptinoLahirRendy Septino Aji22 September 1988 (umur 35)Depok, Jawa Barat, IndonesiaPekerjaanaktormodelTahun aktif2005—...

Kereta api batu bara swasta Sumatera Selatan

Kereta api batu bara swasta Sumatera Selatan

Kereta api Batu Bara Swasta SumselKA Baracinta ditarik oleh lokomotif CC206 di Stasiun Kertapati dengan gerbong batubara kosong.IkhtisarSistemKereta api BarangStatusBeroperasiLokasiDivisi Regional III PalembangTerminusKertapati (PT BAU dan PT BMSS)Simpang (PT. RMKE PT FMS PT KAI Logistik PT BMSS)SerdangGlumbangSukacintaBanjarsariMuara LawaiGunung MegangLayanan9OperasiDibuka2011; 12 tahun lalu (2011) (PT BAU)2012; 10 tahun lalu (2012) (PT BMSS)2016; 6 tahun lalu (2016) (PT. RMK)...

Scala Mercalli

Disambiguazione – Se stai cercando l'omonimo programma televisivo, vedi Scala Mercalli (programma televisivo). Giuseppe Mercalli sul Vesuvio La scala Mercalli è una scala di valutazione dell'intensità di un terremoto eseguita osservando i danni che esso produce sulle persone, cose e manufatti. Questa valutazione non richiede l'utilizzo di strumenti di misurazione e per la sua caratteristica descrittiva può essere applicata anche alla classificazione di terremoti avvenuti in tempi storici...

Асікаґа Мотіудзі

Асікаґа Мотіудзі

Асікаґа Мотіудзі足利持氏 Сеппуку Асікаґа МотіудзіНародився 1398(1398)КамакураПомер 24 березня 1439монастир Дзуйсен·самогубствоПоховання Betsugan-jid : Підданство ЯпоніяУчасник Eikyō WardПосада канто-кубоТермін 1409—1439 рокиПопередник Асікаґа МіцуканеНаступник Асікаґа Сіґеудзі

Iváncsa KSE

Hungarian football club Football clubIváncsa KSEFull nameIváncsa Községi SportegyesületFounded1920; 103 years ago (1920)GroundKárolyi István SporttelepCapacity750LeagueNB III2022–23NB III, Centre, 1st of 20 (promotion play-offs) Home colours Iváncsa Községi Sportegyesület is a professional football club based in Iváncsa, Fejér County, Hungary, that competes in the Nemzeti Bajnokság III, the third tier of Hungarian football.[1] History Iváncsa is goin...

Abdillah Toha

Artikel ini tidak memiliki referensi atau sumber tepercaya sehingga isinya tidak bisa dipastikan. Tolong bantu perbaiki artikel ini dengan menambahkan referensi yang layak. Tulisan tanpa sumber dapat dipertanyakan dan dihapus sewaktu-waktu.Cari sumber: Abdillah Toha – berita · surat kabar · buku · cendekiawan · JSTOR Abdillah Toha (lahir 29 April 1942) adalah Penasehat Wakil Presiden RI 2009-2014 bidang Telaah Strategi adalah mantan Anggota DPR RI Daer...

International Telecommunication Union

International Telecommunication Union

Specialized agency of the United Nations ITU redirects here. For other uses, see ITU (disambiguation). International Telecommunication UnionAbbreviationITUFormation17 May 1865; 158 years ago (1865-05-17)TypeUnited Nations specialized agencyHeadquartersGeneva, SwitzerlandSecretary-generalDoreen Bogdan-MartinDeputy secretary generalTomas LamanauskasParent organizationUnited Nations Economic and Social CouncilWebsiteitu.int The International Telecommunication Union (ITU)[No...

Сатана

Сатана івр. שָׂטָן‏‎‎ Божество в Theistic Satanismd, Авраамічні релігії і СатанізмПерсонаж твору Авраамічні релігії, Втрачений рай, Little Nickyd, Біблія, The Devil and Tom Walkerd і Божий принцип Медіафайли у Вікісховищі Частина серії статей на тему:Традиційна релігіяСа

إدارة الأزمات

تحتاج هذه المقالة كاملةً أو أجزاءً منها لإعادة الكتابة حسبَ أسلوب ويكيبيديا. فضلًا، ساهم بإعادة كتابتها لتتوافق معه. (أبريل 2023) إدارة أعمال إدارة عمل تجاري محاسبة محاسبة إدارية محاسبة مالية تدقيق مالي شخصية معنوية Corporate group تكتل (شركة) شركة قابضة جمعية تعاونية مؤسسة تجارية �...

Lions in the Street

Lions in the Street

Lions in the Street Front coverAuthorPaul HoffmanCountryUnited StatesLanguageEnglishSubjectLawPublisherSaturday Review PressPublication date1973Media typeHardcoverPages274ISBN0-8415-0235-8OCLC645209Dewey Decimal338.7/6134/0097471 19LC ClassKF297 .H6 Lions in the Street: The Inside Story of the Great Wall Street Law Firms is a 1973 book by Paul Hoffman.[1] Overview The book describes the great Wall Street law firms of the 1970s, prominent cases, traditions and a community of ...

Ukraine v. Russian Federation (2022)

Ukraine v. Russian Federation (2022)

International Court of Justice case Not to be confused with Ukraine v. Russian Federation (2019) or International Criminal Court investigation in Ukraine. Ukraine v. Russian FederationCourtInternational Court of JusticeFull case nameAllegations of Genocide under the Convention on the Prevention and Punishment of the Crime of Genocide Started26 February 2022Transcript(s)www.icj-cij.org/public/files/case-related/182/182-20220307-ORA-01-00-BI.pdfCourt membershipJudges sittingJoan Donoghue (Presi...

Suzaku (film)

1997 Japanese filmSuzakuDirected byNaomi KawaseScreenplay byNaomi KawaseProduced byTakenori SentōKōji KobayashiYasushi TsugeStarringJun KunimuraMachiko OnoCinematographyMasaki TamuraEdited byShūichi KakesuMusic byMasamichi ShigenoProductioncompaniesWOWOWBandai VisualRelease date1 November 1997Running time95 minutesCountryJapanLanguageJapanese Suzaku (Japanese: 萌の朱雀, Hepburn: Moe no Suzaku) is a Japanese fiction film from 1997 written and directed by Naomi Kawase (in her feature dir...

Pusia savignyi

Species of gastropod Pusia savignyi Shell of Pusia savignyi (specimen at MNHN, Paris) Scientific classification Domain: Eukaryota Kingdom: Animalia Phylum: Mollusca Class: Gastropoda Subclass: Caenogastropoda Order: Neogastropoda Superfamily: Turbinelloidea Family: Costellariidae Genus: Pusia Species: P. savignyi Binomial name Pusia savignyi(Payraudeau, 1826) Synonyms[1] Mitra savignyi Payraudeau, 1826 (original combination) Pusia (Ebenomitra) savignyi (Payraudeau, 1826) Turricul...

ガイウス・コッケイウス・バルブス

ガイウス・コッケイウス・バルブス

ガイウス・コッケイウス・バルブスC. Cocceius Balbus出生不明生地不明死没不明死没地不明出身階級プレブス氏族コッケイウス氏族官職補充執政官（紀元前39年）前執政官またはレガトゥス（紀元前38年-35年）担当属州マケドニア属州（紀元前38年-35年）テンプレートを表示ガイウス・コッケイウス・バルブス（ラテン語: Gaius Cocceius Balbus、生没年不明）は紀元前1世�...

كي (مسمارية)

كي (مسمارية)

لمعانٍ أخرى، طالع كي (توضيح). شكل كي في رسائل العمارنة كي مصطلح مسماري أو علامة تشير إلى «الأرض»، وتُقرأ أيضاً غي5 مثل غوني (= كي.ني) وتعني «الموقد» وكاراش (= كي.كال.باد) وتعني «المعسكر أوالجيش» وكيسلاه (= كي.أود) وتعني «الدرس». وفي قواعد الإملاء الأكادية يعمل كمحدد للأسما�...

Chief Technology Officer of the United States

Chief Technology Officer of the United States

Key technology policy advisor to the President of the United States The United States Chief Technology Officer (US CTO) is an official in the Office of Science and Technology Policy.[1] The U.S. CTO helps the President and their team harness the power of data, innovation and technology on behalf of the American people. The CTO works closely with others both across and outside government on a broad range of work including utilizing technology to improve the government and its services,...

Động vật ăn cỏ

Nai và nai con đang ăn lá Động vật ăn cỏ là động vật sống dựa vào việc ăn các nguồn thức ăn từ thực vật. Hebivory là một hình thức tiêu thụ, trong đó một sinh vật chủ yếu ăn sinh vật tự dưỡng ví dụ như thực vật, tảo và vi khuẩn quang hợp. Hebivory thường dùng để chỉ các động vật ăn thực vật; nấm, vi khuẩn, sinh vật đơn bào ăn các loài thực vật sống được gọi là vi sinh v...

Boston Minutemen

Boston Minutemen

This article relies largely or entirely on a single source. Relevant discussion may be found on the talk page. Please help improve this article by introducing citations to additional sources.Find sources: Boston Minutemen – news · newspapers · books · scholar · JSTOR (April 2022)This article is about the 1970s soccer team. For the Revolutionary War personages in/from Boston, see Minutemen. Soccer clubBoston MinutemenFull nameBoston MinutemenFounded1974...

Pelatihan penerbangan

Pelatihan penerbangan

Seorang instruktur penerbangan pesawat Kanada (kiri) dan siswanya, di depan Cessna 172 seusai menyelesaikan latihan. Pelatihan penerbangan adalah program studi yang diajarkan ketika belajar mengemudikan sebuah pesawat terbang. Tujuan keseluruhan dari pelatihan penerbangan utama dan menengah ialah untuk memperoleh dan mengasah keterampilan dasar airmanship.[1] Pelatihan penerbangan dapat dilaksanakan dengan silabus terakreditasi didampingi seorang instruktur penerbang di sekolah penerb...

My Dear Boy

Taiwanese TV series or program My Dear BoyPromotional posterAlso known asMy BoyTraditional Chinese我的男孩Simplified Chinese我的男孩Hanyu PinyinWǒ De Nán Hái GenreRomanceWritten byMag Hsu (徐譽庭)Directed byHsu Fu-chun (徐輔軍)StarringRuby Lin Derek Chang Archie Kao Lee Lee-zen Opening theme溫柔的奇蹟 - FanFanEnding theme不需要知道- William WeiCountry of originTaiwanOriginal languageMandarinNo. of episodes20ProductionProducersRuby LinLisa TanProduction loc...

Wangsa Přemyslid

بالاو دي سانتا إيولاليا

جورج واشنطن مارتن الثاني

Aeropuerto Municipal de Kodiak

Albo d'oro del campionato ungherese di calcio

Journey Through the Night

Dewan Perwakilan Rakyat Daerah Kabupaten Musi Rawas Utara

List of major power stations in Jiangsu province

Babel Fish (website)

Daftar Wakil Bupati Batu Bara

Wolves in folklore, religion and mythology

Czech Republic in the Eurovision Song Contest 2017

Terengganu Inscription Stone

Growing Pains (1928 film)

Aha Shake Heartbreak

Renato Mastropietro

غروب وشروق (فيلم)

Universitas Gunadarma

Ruslan Mezentsev

Ferdinand Raimund

후지필름 GFX 50S

Achmad Jaka Santos Adiwijaya

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!