Damerau–Levenshtein distance

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein[1][2][3]) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).[4][2]

In his seminal paper,[5] Damerau stated that in an investigation of spelling errors for an information-retrieval system, more than 80% were a result of a single error of one of the four types. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.[6]

Definition

To express the Damerau–Levenshtein distance between two strings and , a function is defined, whose value is a distance between an -symbol prefix (initial substring) of string and a -symbol prefix of .

The restricted distance function is defined recursively as:[7]: A:11  where is the indicator function equal to 0 when and equal to 1 otherwise.

Each recursive call matches one of the cases covered by the Damerau–Levenshtein distance:

  • corresponds to a deletion (from a to b),
  • corresponds to an insertion (from a to b),
  • corresponds to a match or mismatch, depending on whether the respective symbols are the same,
  • corresponds to a transposition between two successive symbols.

The Damerau–Levenshtein distance between a and b is then given by the function value for full strings: , where denotes the length of string a, and is the length of b.

Algorithm

Presented here are two algorithms: the first,[8] simpler one, computes what is known as the optimal string alignment distance or restricted edit distance,[7] while the second one[9] computes the Damerau–Levenshtein distance with adjacent transpositions. Adding transpositions adds significant complexity. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction.

Take for example the edit distance between CA and ABC. The Damerau–Levenshtein distance LD(CAABC) = 2 because CAACABC, but the optimal string alignment distance OSA(CAABC) = 3 because if the operation CAAC is used, it is not possible to use ACABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CAAABABC. Note that for the optimal string alignment distance, the triangle inequality does not hold: OSA(CAAC) + OSA(ACABC) < OSA(CAABC), and so it is not a true metric.

Optimal string alignment distance

Optimal string alignment distance can be computed using a straightforward extension of the Wagner–Fischer dynamic programming algorithm that computes Levenshtein distance. In pseudocode:

algorithm OSA-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    
    let d[0..length(a), 0..length(b)] be a 2-d array of integers, dimensions length(a)+1, length(b)+1
    // note that d is zero-indexed, while a and b are one-indexed.
    
    for i := 0 to length(a) inclusive do
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[0, j] := j
    
    for i := 1 to length(a) inclusive do
        for j := 1 to length(b) inclusive do
            if a[i] = b[j] then
                cost := 0
            else
                cost := 1
            d[i, j] := minimum(d[i-1, j] + 1,     // deletion
                               d[i, j-1] + 1,     // insertion
                               d[i-1, j-1] + cost)  // substitution
            if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then
                d[i, j] := minimum(d[i, j],
                                   d[i-2, j-2] + 1)  // transposition
    return d[length(a), length(b)]

The difference from the algorithm for Levenshtein distance is the addition of one recurrence:

if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then
    d[i, j] := minimum(d[i, j],
                       d[i-2, j-2] + 1)  // transposition

Distance with adjacent transpositions

The following algorithm computes the true Damerau–Levenshtein distance with adjacent transpositions; this algorithm requires as an additional parameter the size of the alphabet Σ, so that all entries of the arrays are in [0, |Σ|):[7]: A:93 

algorithm DL-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    
    da := new array of |Σ| integers
    for i := 1 to |Σ| inclusive do
        da[i] := 0
    
    let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
    // note that d has indices starting at −1, while a, b and da are one-indexed.
    
    maxdist := length(a) + length(b)
    d[−1, −1] := maxdist
    for i := 0 to length(a) inclusive do
        d[i, −1] := maxdist
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[−1, j] := maxdist
        d[0, j] := j
    
    for i := 1 to length(a) inclusive do
        db := 0
        for j := 1 to length(b) inclusive do
            k := da[b[j]]
            ℓ := db
            if a[i] = b[j] then
                cost := 0
                db := j
            else
                cost := 1
            d[i, j] := minimum(d[i−1, j−1] + cost,  //substitution
                               d[i,   j−1] + 1,     //insertion
                               d[i−1, j  ] + 1,     //deletion
                               d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
        da[a[i]] := i
    return d[length(a), length(b)]

To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance, note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. (This holds as long as the cost of a transposition, , is at least the average of the cost of an insertion and deletion, i.e., .[9]) Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: , where M and N are string lengths. Using the ideas of Lowrance and Wagner,[9] this naive algorithm can be improved to be in the worst case, which is what the above pseudocode does.

It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of[1] for an example of such an adaptation.

Applications

Damerau–Levenshtein distance plays an important role in natural language processing. In natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. Oommen and Loke[8] even mitigated the limitation of the restricted edit distance by introducing generalized transpositions. Nevertheless, one must remember that the restricted edit distance usually does not satisfy the triangle inequality, and thus cannot be used with metric trees.

DNA

Since DNA frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA.[6] More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman–Wunsch algorithm or Smith–Waterman algorithm.[citation needed]

Fraud detection

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature, there is a risk of entering a false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.

Export control

The U.S. Government uses the Damerau–Levenshtein distance with its Consolidated Screening List API.[10]

See also

  • Ispell – Suggests corrections that are based on a Damerau–Levenshtein distance of 1
  • Typosquatting – Form of cybersquatting which relies on mistakes when inputting a website address

References

  1. ^ Brill, Eric; Moore, Robert C. (2000). An Improved Error Model for Noisy Channel Spelling Correction (PDF). Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. pp. 286–293. doi:10.3115/1075218.1075255. Archived from the original (PDF) on 2012-12-21.
  2. ^ a b Bard, Gregory V. (2007), "Spelling-error tolerant, order-independent pass-phrases via the Damerau–Levenshtein string-edit distance metric", Proceedings of the Fifth Australasian Symposium on ACSW Frontiers : 2007, Ballarat, Australia, January 30 - February 2, 2007, Conferences in Research and Practice in Information Technology, vol. 68, Darlinghurst, Australia: Australian Computer Society, Inc., pp. 117–124, ISBN 978-1-920682-49-1.
  3. ^ Li; et al. (2006). Exploring distributional similarity based models for query spelling correction (PDF). Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. pp. 1025–1032. doi:10.3115/1220175.1220304. Archived from the original (PDF) on 2010-04-01.
  4. ^ Levenshtein, Vladimir I. (February 1966), "Binary codes capable of correcting deletions, insertions, and reversals", Soviet Physics Doklady, 10 (8): 707–710, Bibcode:1966SPhD...10..707L
  5. ^ Damerau, Fred J. (March 1964), "A technique for computer detection and correction of spelling errors", Communications of the ACM, 7 (3): 171–176, doi:10.1145/363958.363994, S2CID 7713345
  6. ^ a b Majorek, Karolina A.; Dunin-Horkawicz, Stanisław; et al. (2013), "The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification", Nucleic Acids Research, 42 (7): 4160–4179, doi:10.1093/nar/gkt1414, PMC 3985635, PMID 24464998
  7. ^ a b c Boytsov, Leonid (May 2011). "Indexing methods for approximate dictionary searching". Journal of Experimental Algorithmics. 16: 1. doi:10.1145/1963190.1963191. S2CID 15635688.
  8. ^ a b Oommen, B. J.; Loke, R. K. S. (1997). "Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions". Pattern Recognition. 30 (5): 789–800. Bibcode:1997PatRe..30..789O. CiteSeerX 10.1.1.50.1459. doi:10.1016/S0031-3203(96)00101-X.
  9. ^ a b c Lowrance, Roy; Wagner, Robert A. (April 1975), "An Extension of the String-to-String Correction Problem", J ACM, 22 (2): 177–183, doi:10.1145/321879.321880, S2CID 18892193
  10. ^ "Consolidated Screening List API". Trade.gov Developer Portal. International Trade Administration, U.S. Department of Commerce. Archived from the original on 2019-10-30.

Further reading

Read other articles:

Election for the governorship of the U.S. state of Oklahoma See also: 2022 United States gubernatorial elections 2022 Oklahoma gubernatorial election ← 2018 November 8, 2022 2026 → Turnout50.2%   Nominee Kevin Stitt Joy Hofmeister Party Republican Democratic Popular vote 639,484 481,904 Percentage 55.4% 41.8% Precinct results County results Congressional district results State Senate district resultsStitt:      40–50%   ...

 

Bibliography of the history of the Caucasus CaucasusTopography of the CaucasusCoordinates42°15′40″N 44°07′16″E / 42.26111°N 44.12111°E / 42.26111; 44.12111Countries[1][2] Armenia Azerbaijan Georgia Russia Related areas  Iran  Turkey Partially recognized or unrecognized countries Abkhazia Artsakh[note 1] South OssetiaAutonomous republics and federal regions Abkhazia(since 2008, in exile)  Adjara...

 

Tyrone Mings Mings bermain untuk Aston Villa pada 2021Informasi pribadiNama lengkap Tyrone Deon Mings[1]Tanggal lahir 13 Maret 1993 (umur 30)Tempat lahir Bath, InggrisTinggi 6 ft 5 in (1,96 m)[2]Posisi bermain Bek tengahInformasi klubKlub saat ini Aston VillaNomor 40Karier junior2001–2009 Southampton2009–2011 Bristol RoversKarier senior*Tahun Tim Tampil (Gol)2011–2012 Yate Town 2012 Chippenham Town 10 (0)2012–2015 Ipswich Town 57 (1)2015–2019 AFC ...

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (نوفمبر 2021) كلية إدارة الأعمال بجامعة كولورادو دنفر   معلومات التأسيس 1985  الموقع الجغرافي إحداثيات 39°44′51″N 104°59′54″W / 39.74755°N 104.99833°W / 39.74755; -104.99833  ال

 

Kolonel CZI (Purn.)Adrianus SooaiAdrianus Sooai, menjelang pensiun tahun 1996.Ketua Dewan Perwakilan Rakyat Daerah Provinsi Nusa Tenggara Timur ke-9Masa jabatan1989–1992PresidenSoehartoGubernurdr. Hendrikus FernandezWakilFelix Jos Pullu, SHPendahuludr. Hendrikus FernandezPenggantiJan Jos BhotaBupati Sumba Timur ke-4Masa jabatan1984–1989PresidenSoehartoGubernurBrigjen TNI (Purn.) dr. Aloysius Benedictus Mboi, M.P.H.Pendahuludr. Lapoe MoekoePenggantiT. P. Munthe Informasi pribadiLah...

 

Pheidippides Patung Pheidippides di Jalan Marathon.Lahirsekitar 530 SMAthenaMeninggalsekitar 490 SMAthena Pheidippides (bahasa Yunani: Φειδιππίδης, terkadang disebut Phidippides, oleh Herodotos dan Plutarkhos,[1] atau Philippides), pahlawan Yunani kuno, adalah tokoh utama dalah kisah yang menjadi inspirasi bagi perlombaan olahraga modern, maraton. Kisah tradisionalnya menceritakan bahwa Pheidippides (530 SM–490 SM), sorang pembawa pesan atau hemerodrome Athena[2&#...

Chemical compound HydroxyflutamideClinical dataOther names2-Hydroxyflutamide; HF; OHF; Flutamide-hydroxide; SCH-16423; Hydroxyniphtholide; Hydroxyniftolide; α,α,α-Trifluoro-2-methyl-4'-nitro-m-lactotoluidideDrug classNonsteroidal antiandrogenIdentifiers IUPAC name 2-hydroxy-2-methyl-N-[4-nitro-3-(trifluoromethyl)phenyl]propanamide CAS Number52806-53-8PubChem CID91649IUPHAR/BPS2862ChemSpider82752UNII31D90UKP5YKEGGC14204ChEBICHEBI:43064ChEMBLChEMBL491CompTox Dashboard (EPA)DTXSID8033562 ECHA...

 

العلاقات الوسط أفريقية القبرصية جمهورية أفريقيا الوسطى قبرص   جمهورية أفريقيا الوسطى   قبرص تعديل مصدري - تعديل   العلاقات الوسط أفريقية القبرصية هي العلاقات الثنائية التي تجمع بين جمهورية أفريقيا الوسطى وقبرص.[1][2][3][4][5] مقارنة بين البلدين

 

Map all coordinates using: OpenStreetMap Download coordinates as: KML GPX (all coordinates) GPX (primary coordinates) GPX (secondary coordinates) This is a list of association football stadiums in Scotland, ranked in descending order of capacity. The minimum required capacity is 1,000. Current stadiums This list is incomplete; you can help by adding missing items. (October 2016) No. Image Stadium[notes 1] Capacity City/town Team League[notes 2] Opened Surface Coordinates Notes...

Riversul   Município do Brasil   Símbolos Bandeira Brasão de armas Hino Lema Dominus pacis super nos regineA Rainha da Paz nos rege Gentílico riversulense Localização Localização de Riversul em São PauloLocalização de Riversul em São Paulo RiversulLocalização de Riversul no Brasil Mapa de Riversul Coordenadas 23° 49' 40 S 49° 25' 44 O País Brasil Unidade federativa São Paulo Municípios limítrofes SP: Itararé, Itaberá, Itaporanga, PR: S...

 

Bahrain tạiThế vận hộiMã IOCBRNNOCỦy ban Olympic BahrainHuy chương Vàng Bạc Đồng Tổng số 2 1 0 3 Tham dự Mùa hè1984198819921996200020042008201220162020 Bahrain đã tham gia 9 Thế vận hội Mùa hè và chưa từng tham gia Thế vận hội Mùa đông. Tất cả huy chương Thế vận hội của Bahrain đều được mang về bởi các vận động viên (VĐV) chạy đường dài gốc Phi nhập tịch. Tấm huy chương đầu tiên của nư...

 

马克-安德烈亚·许斯勒Marc-Andrea Hüsler國家/地區 瑞士居住地 瑞士吕施利孔出生 (1996-06-24) 1996年6月24日(27歲) 瑞士苏黎世身高1.96米(6英尺5英寸)轉職業年2016年持拍左手持拍(双手反拍)職業獎金$657,646單打成績職業戰績15–16 (ATP巡回赛)冠軍頭銜1 (ATP)最高排名No. 85 (2022-08-29)現今排名No. 95 (2022-09-26)大滿貫單打成績澳網Q2 (2021)法網Q2 (2021)溫網1R (2022)美網1R (2022)雙打...

Triatlo nosJogos Olímpicos de Verão da Juventude de 2014 masculino   feminino   Estafeta mista   O Triatlo feminino nos Jogos Olímpicos de Verão de 2014 foi disputado no dia 17 de Agosto no Lago Xuanwu em Nanquim, China.[1] 32 atletas de 32 países participaram do evento.[2][3] A vencedora da prova foi a australiana Brittany Dutton com um tempo de 59:56, batendo a medalhada de Prata Stephanie Jenks (EUA) por 37 segundos. A francesa Emilie Morier foi medalha de Bronze.[3] Es...

 

село Новий Замай-Юрт Новый Замай-Юрт Країна  Росія Суб'єкт Російської Федерації Чечня Муніципальний район Ножай-Юртовський район Поселення Мескетинське сільське поселення Код ЗКАТУ: 96225825004 Код ЗКТМО: 96625425116 Основні дані Населення ▲ 137 Поштовий індекс 366221 Географічні ...

 

赫拉多·马蒂诺Gerardo Martino 2017年執教阿特蘭大聯的马蒂诺個人信息全名 Gerardo Daniel Martino出生日期 (1962-11-20) 1962年11月20日(61歲)出生地點 阿根廷罗薩里奥位置 進攻中場俱乐部信息現在所屬 邁阿密國際 (領隊)青年隊 紐維爾舊生職業俱乐部*年份 球隊 出场 (进球)1980–1990 紐維爾舊生 392 (35)1991 特內里費 15 (1)1991–1994 紐維爾舊生 81 (2)1994–1995 拉努斯 30 (3)1995 紐維爾舊生 15 (...

Defunct airline of Australia and Taiwan (1990—1996); former Qantas subsidiary Australia Asia Airlines IATA ICAO Callsign IM AAU AUSTASIA Founded1990 (1990)Commenced operations1990 (1990)Ceased operations1996 (1996)Focus citiesSydneyTaipei–TaoyuanFrequent-flyer programQantas Frequent FlyerFleet size3Parent companyQantasHeadquartersBotany Bay, Sydney, New South Wales, AustraliaKey peopleJames Strong (CEO) Australia Asia Airlines Boeing 747SP at Perth Airport in the mid-1990s....

 

American actress Gloria HopeHope, c. 1920BornOlive Frances(1901-11-09)November 9, 1901Pittsburgh, Pennsylvania, U.S.DiedOctober 29, 1976(1976-10-29) (aged 74)Pasadena, California, U.S.OccupationActressYears active1917–1926Spouses Lloyd Hughes ​ ​(m. 1921; died 1958)​ Joe Bishow ​ ​(m. 1927; ann. 1928)​ Children2 Gloria Hope (born Olive Frances, November 9, 1901 – October 29, 1976)...

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Confessions of an Indian Teenager – news · newspapers · books · scholar · JSTOR (January 2020) (Learn how and when to remove this template message) Indian TV series or programme Confessions of an Indian TeenagerGenreTeen dramaWritten byRahul PatelCountry o...

Conan novelette by Robert E. Howard The Slithering ShadowShort story by Robert E. HowardOriginal titleXuthal of the DuskCountryUnited StatesLanguageEnglishGenre(s)FantasyPublicationPublished inWeird TalesPublication typePulp magazinePublisherRural Publishing CorporationPublication dateSeptember 1933ChronologySeriesConan the Cimmerian  Black Colossus   The Pool of the Black One The Slithering Shadow is one of the original short stories starring the fictional sword and sorcery he...

 

Estonian management scientist Ruth Alas Ruth Alas (5 August 1960 in Türi – 23 January 2018 in Tallinn) was an Estonian management scientist. She was the head of the Department of Management of the Estonian Business School until her death.[1][2] Alas wrote more than 100 articles and 23 textbooks in topics relating to management and business.[3] Education Türi Secondary School (1978)[4] Tallinn Polytechnical Institute (currently Tallinn Technical University),...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!