Edit distance

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Different definitions of an edit distance use different sets of like operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.[1]

Types of edit distance

Different types of edit distance allow different sets of string operations. For instance:

Algorithm Operations Allowed
Insertions Deletions Substitutions Transposition
Levenshtein Distance
Longest Common Subsequence (LCS)
Hamming Distance
Damerau–Levenshtein Distance
Jaro distance

Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Formal definition and properties

Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:[2]

Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
Deletion of a single symbol changes uxv to uv (x→ε).
Substitution of a single symbol x for a symbol yx changes uxv to uyv (xy).

In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(xy) with the operations.[2]

Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv.[3][4] For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.[4]

Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.[1]: 37  Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.[1] Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

  1. kitten → sitten (substitute "s" for "k")
  2. sitten → sittin (substitute "i" for "e")
  3. sittin → sitting (insert "g" at the end)

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

  1. kitten → itten (delete "k" at 0)
  2. itten → sitten (insert "s" at 0)
  3. sitten → sittn (delete "e" at 4)
  4. sittn → sittin (insert "i" at 4)
  5. sittin → sitting (insert "g" at 6)

for a total cost/distance of 5 operations.

Properties

Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met:[1]: 37 

  • Every edit operation has positive cost;
  • for every operation, there is an inverse operation with equal cost.

With these properties, the metric axioms are satisfied as follows:

d(a, b) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.
d(a, b) > 0 when ab, since this would require at least one operation at non-zero cost.
d(a, b) = d(b, a) by equality of the cost of each operation and its inverse.
Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).[5]

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.[1]

Other useful properties of unit-cost edit distances include:

  • LCS distance is bounded above by the sum of lengths of a pair of strings.[1]: 37 
  • LCS distance is an upper bound on Levenshtein distance.
  • For strings of the same length, Hamming distance is an upper bound on Levenshtein distance.[1]

Regardless of cost/weights, the following property holds of all edit distances:

  • When a and b share a common prefix, this prefix has no effect on the distance. Formally, when a = uv and b = uw, then d(a, b) = d(v, w).[4] This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.

Computation

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.[6]

Common algorithm

Using Levenshtein's original operations, the (nonsymmetric) edit distance from to is given by , defined by the recurrence[2]

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.[3]

The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer,[7] although it has a history of multiple invention.[2][3] After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at .

This algorithm has a time complexity of Θ(mn) where m and n are the lengths of the strings. When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations.[3] A linear-space solution to this problem is offered by Hirschberg's algorithm.[8]: 634  A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.[9]

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants,[10] one of which takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s2) or O(s), depending on whether the edit sequence needs to be read off.[3]

Further improvements by Landau, Myers, and Schmidt [1] give an O(s2 + max(m,n)) time algorithm.[11]

For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson[12] having worst case runtime of O(nm/logn).

Applications

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

  • Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
  • Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string p, called the pattern, and a constant k; it then builds a deterministic finite state automaton that finds, in an arbitrary string s, a substring whose edit distance to p is at most k[13] (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
  • Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.[4]

Language edit distance

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet Σ, the language edit distance d(L, x) is given by[14], where is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance.[15] For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.[16]

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.[14][17]

See also

References

  1. ^ a b c d e f g Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching" (PDF). ACM Computing Surveys. 33 (1): 31–88. CiteSeerX 10.1.1.452.6317. doi:10.1145/375360.375365. S2CID 207551224. Retrieved 19 March 2015.
  2. ^ a b c d Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
  3. ^ a b c d e Esko Ukkonen (1983). On approximate string matching. Foundations of Computation Theory. Springer. pp. 487–495. doi:10.1007/3-540-12689-9_129.
  4. ^ a b c d Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652. doi:10.1007/s10032-002-0082-8. S2CID 207046453.
  5. ^ Lei Chen; Raymond Ng (2004). On the marriage of Lp-norms and edit distance (PDF). Proc. 30th Int'l Conf. on Very Large Databases (VLDB). Vol. 30. doi:10.1016/b978-012088469-8.50070-x.
  6. ^ Kukich, Karen (1992). "Techniques for Automatically Correcting Words in Text" (PDF). ACM Computing Surveys. 24 (4): 377–439. doi:10.1145/146370.146380. S2CID 5431215. Archived from the original (PDF) on 2016-09-27. Retrieved 2017-11-09.
  7. ^ R. Wagner; M. Fischer (1974). "The string-to-string correction problem". J. ACM. 21: 168–178. doi:10.1145/321796.321811. S2CID 13381535.
  8. ^ Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. Bibcode:2008adm..book.....S. ISBN 978-1-849-96720-4.
  9. ^ Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010). "Cache-oblivious dynamic programming for bioinformatics". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 7 (3): 495–510. doi:10.1109/TCBB.2008.94. PMID 20671320. S2CID 2532039.
  10. ^ Ukkonen, Esko (1985). "Algorithms for approximate string matching" (PDF). Information and Control. 64 (1–3): 100–118. doi:10.1016/S0019-9958(85)80046-2.
  11. ^ Landau; Myers; Schmidt (1998). "Incremental String Comparison". SIAM Journal on Computing. 27 (2): 557–582. CiteSeerX 10.1.1.38.1766. doi:10.1137/S0097539794264810.
  12. ^ Masek, William J.; Paterson, Michael S. (February 1980). "A faster algorithm computing string edit distances". Journal of Computer and System Sciences. 20 (1): 18–31. doi:10.1016/0022-0000(80)90002-1. hdl:1721.1/148933. ISSN 0022-0000.
  13. ^ Esko Ukkonen (1985). "Finding approximate patterns in strings". J. Algorithms. 6: 132–137. doi:10.1016/0196-6774(85)90023-9.
  14. ^ a b Bringmann, Karl; Grandoni, Fabrizio; Saha, Barna; Williams, Virginia Vassilevska (2016). "Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product" (PDF). 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). pp. 375–384. arXiv:1707.05095. doi:10.1109/focs.2016.48. ISBN 978-1-5090-3933-3. S2CID 17064578.
  15. ^ Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages". SIAM Journal on Computing. 1 (4): 305–312. doi:10.1137/0201022. ISSN 0097-5397.
  16. ^ Wagner, Robert A. (1974). "Order-n correction for regular languages". Communications of the ACM. 17 (5): 265–268. doi:10.1145/360980.360995. S2CID 11063282.
  17. ^ Saha, B. (2014-10-01). The Dyck Language Edit Distance Problem in Near-Linear Time. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. pp. 611–620. doi:10.1109/FOCS.2014.71. ISBN 978-1-4799-6517-5. S2CID 14806359.

Read other articles:

كيمياء كهربائيةصنف فرعي من كيمياء فيزيائية جزء من كيمياء فيزيائية يمتهنه electrochemist (en) التاريخ history of electrochemistry (en) تعديل - تعديل مصدري - تعديل ويكي بيانات خلية كهروكيميائية وتتكون من لوحين معدنيين مختلفين في كهرل ، وقد أضيف إليهما لوح ثالث (في الوسط) كلوح مرجعي لإجراء قياسات. الك

 

This article duplicates the scope of other articles, specifically Troupes de la Marine. Please discuss this issue and help introduce a summary style to the article. (October 2020) Compagnies franches de la marineDuring the celebration of the 400th Anniversary of the city of QuebecActive 1622: Compagnie ordinaire de la mer 1666: Régiment de la marine 1673-1761: Compagnies franches de la marine Country Kingdom of FranceAllegianceMarine RoyaleBranch French NavyTypeNaval InfantryRoleSe...

 

Drei silberne byzantinische Weihebrotschalen (Dumbarton Oaks Collection) Der Diskos (δίσκος dískos, deutsch ‚Scheibe, Teller, Schüssel‘) ist ein liturgisches Gefäß, das für die orthodoxe Eucharistiefeier (Göttliche Liturgie) verwendet wird. Er wird auch als Weihebrotschale bezeichnet und entspricht der westkirchlichen Patene. Auf den Diskos wird bei der Gabenbereitung (Proskomidie) das eucharistische Brot gelegt; darüber wird der Asteriskos, ein sternförmiger Bügel,...

Чемпіонат Європи з легкої атлетики в приміщенні 1996Загальна інформаціяМісто СтокгольмКраїни-учасниці 44Кількість атлетів 463Розігрується медалей 26 комплектівВідкриття 8 березняЗакриття 10 березняАрена Глобен← 1994 Париж 1998 Валенсія → Чемпіонат Європи з легкої атлетик

 

Otkrytie Arena Estadio de máxima categoría de la UEFA LocalizaciónPaís  RusiaLocalidad  MoscúCoordenadas 55°49′04″N 37°26′25″E / 55.8178, 37.4403Detalles generalesSuperficie CéspedDimensiones 105 x 68 mCapacidad 45.360 espectadoresPropietario F. C. Spartak Moscú[1]​ConstrucciónCoste 14 mil millones de rublos (430 millones $)[2]​Apertura 5 de septiembre de 2014 (9 años)Equipo diseñadorArquitecto AECOM[3]​Equipo local FC ...

 

French mathematician Anna CartanMarthe and Anna Cartan. Sitting: Madeleine, Marie Curie and Eugénie Cotton.Born(1878-05-15)15 May 1878Died1923 (aged 44–45)Occupation(s)Mathematician, teacher and author Anna Cartan (15 May 1878 – 1923) was a French mathematician, teacher and textbook author who was a student of Marie Curie and Jules Tannery. Early years Cartan was the youngest child born to Anne Florentine Cottaz (1841–1927) and Joseph Antoine Cartan (1837–1917), who was ...

Australian writer (1871–1951) Percival Serle (1871-1951) taken c1949. Author of Dictionary of Australian Biography Percival Serle (18 July 1871 – 16 December 1951) was an Australian biographer and bibliographer. Early life Serle was born in Elsternwick, Victoria to English parents who had migrated as children[1] and for many years worked in a life assurance office before in November 1910 becoming chief clerk and accountant at the University of Melbourne. He married artist Dora Bea...

 

Artikel ini membahas mengenai bangunan, struktur, infrastruktur, atau kawasan terencana yang sedang dibangun atau akan segera selesai. Informasi di halaman ini bisa berubah setiap saat (tidak jarang perubahan yang besar) seiring dengan penyelesaiannya.Stasiun Cicalengka B23C23 Stasiun Cicalengka, setelah dipasang papan nama baru versi 2020.LokasiJalan Stasiun Cicalengka No. 1Panenjoan, Cicalengka, Bandung, Jawa Barat 40395IndonesiaKetinggian+689 mOperatorKAI CommuterLetak dari pangkalkm 182+2...

 

Artikel biografi ini ditulis menyerupai resume atau daftar riwayat hidup (Curriculum Vitae). Tolong bantu perbaiki agar netral dan ensiklopedis. dr. H.Mundjirin ESSp.OGBupati Semarang ke-39Masa jabatan17 Februari 2016[1] – 17 Februari 2021GubernurGanjar PranowoWakilNgesti NugrahaPendahuluSujarwanto Dwiatmoko (Pj.)PenggantiNgesti NugrahaMasa jabatan2010 – 2015GubernurBibit WaluyoGanjar PranowoWakilWarnadiPendahuluSiti Ambar FathonahPenggantiSujarwanto Dwiatmok...

Shopping mall in Pathum Wan, BangkokAmarin Plazaอัมรินทร์พลาซ่าAmarin Plaza in 2021Location496–502 Phloen Chit Rd, Lumphini, Pathum Wan, Bangkok, 10330Coordinates13°44′37″N 100°32′29″E / 13.74361°N 100.54139°E / 13.74361; 100.54139Opening date1985No. of stores and services300+No. of floors5Websitewww.amarinplaza.com Amarin Plaza (Thai: อัมรินทร์พลาซ่า), re-branded since 2023 as Gaysorn Amarin (...

 

Friedrich August Lehner (1824–1895) Friedrich August Lehner (* 10. Oktober 1824 in Geislingen; † 3. Juni 1895 in Stuttgart) war ein deutscher Kunsthistoriker und Museumsdirektor. Inhaltsverzeichnis 1 Leben 2 Schriften 3 Literatur 4 Einzelnachweise Leben Friedrich August Lehner wurde als Sohn eines aus Oeffingen stammenden Forstbeamten in Geislingen geboren. Für den geistlichen Stand bestimmt, besuchte er die Lateinschule in Balingen, dann das Gymnasium in Rottweil und studierte in Tübin...

 

السيد  عبد المطلب الأمين معلومات شخصية الميلاد سنة 1915  دمشق  الوفاة سنة 1974 (58–59 سنة)  بيروت  سبب الوفاة تشمع الكبد  مكان الدفن شقرا  مواطنة الدولة العثمانية (1915–1920) دولة دمشق (1920–1925) الدولة السورية (1925–1930) الجمهورية السورية الأولى (1930–1950) الجمهورية السور...

1993 compilation album by Various ArtistsThe Beavis and Butt-Head ExperienceCompilation album by Various ArtistsReleasedNovember 23, 1993Recorded1993GenreAlternative rock, hard rock, heavy metal, hip-hop, funk rock, grunge, comedyLength59:27LabelGeffenProducerVariousBeavis and Butt-Head chronology The Beavis and Butt-Head Experience(1993) Beavis and Butt-Head Do America: Original Motion Picture Soundtrack(1996) Professional ratingsReview scoresSourceRatingAllMusic[1]Calgary He...

 

Recreational voyages along the Turkish Riviera This article contains content that is written like an advertisement. Please help improve it by removing promotional content and inappropriate external links, and by adding encyclopedic content written from a neutral point of view. (May 2021) (Learn how and when to remove this template message) Traditional two-masted gulet schooner visiting a cove in Gökova as part of the Blue Voyage The Isle of Kekova is among the popular destinations of the Blu...

 

Bahasa Kurmanji Kurmancî, Kurdiya Jorîn Kurmanji Dituturkan di  Iran  Irak  Suriah  Turki Penutur15 juta di Turki (2009)[1]Mungkin 5 juta tempat lain, termasuk 2,8 juta di Irak (2004), 940.000 di Suriah (1993), dan 350.000 di Iran (1988)[1]Rumpun bahasaIndo-Eropa Indo-IranIranIran BaratIran Barat LautKurdiKurmanji DialekToriki Botani Bazidi Bakrani Hakkari Badini Shengali Judikani Jiwanshiri Alburzi Qochani Birjendi Rihayi Sistem penulisanLatin ...

Franciszek Fesser Franciszek Fesser (born 16 August 1885 in Rogi, died 23 October 1956 in Piotrowice (Katowice), was a Polish miner, union activist, Upper Silesian politician, insurgent, and deputy of the Silesian Parliament from 1930 to 1939. Biography During his youth Fesser worked as a miner in the Cleopas mine. In 1918 he joined the Polish Military Organisation in his home district of Opole, where he took part in the 1921 plebiscite in Upper Silesia and the Silesian Uprisings against Germ...

 

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article is an orphan, as no other articles link to it. Please introduce links to this page from related articles; try the Find link tool for suggestions. (January 2021) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged ...

 

1970 soundtrack album by various artistsZabriskie PointSoundtrack album by various artistsReleasedMarch 1970GenreRock[1]Length36:22LanguageEnglishLabelMGM (original issue)Rhino (early reissues)Sony Classical/WaterTower Music (current reissues)Pink Floyd soundtracks chronology More(1969) Zabriskie Point(1970) Obscured by Clouds(1972) Professional ratingsReview scoresSourceRatingAllMusic[1]Christgau's Record GuideB−[2] Zabriskie Point is a soundtrack album to t...

Lists of South Korean films by year ← 2005 2006 2007 → Korean Animation Full list . . Pre-1948 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 •00000•00...

 

ABC Daytime Propietario The Walt Disney CompanyOperado por Walt Disney TelevisionPaís  Estados UnidosInicio de transmisiones 1960Sitio web abc.go.com/daytime[editar datos en Wikidata] ABC Daytime (simplificado como ABC-D o ABCD) es un bloque de programación de la cadena de televisión estadounidense American Broadcasting Company que abarca seriales, concursos, y talk shows. Horario 7:00 a. m. – 9:00 a. m. Good Morning America 11:00 a. m. – 12:00 p. m. ...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!