Longest common substring

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

Examples

The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA".

The picture shows two strings where the problem has multiple solutions. Although the substring occurrences always overlap, it is impossible to obtain a longer common substring by "uniting" them.

The strings "ABABC", "BABCA" and "ABCBA" have only one longest common substring, viz. "ABC" of length 3. Other common substrings are "A", "AB", "B", "BA", "BC" and "C".

  ABABC
    |||
   BABCA
    |||
    ABCBA

Problem definition

Given two strings, of length and of length , find a longest string which is substring of both and .

A generalization is the k-common substring problem. Given the set of strings , where and . Find for each , a longest string which occurs as substring of at least strings.

Algorithms

One can find the lengths and starting positions of the longest common substrings of and in time with the help of a generalized suffix tree. A faster algorithm can be achieved in the word RAM model of computation if the size of the input alphabet is in . In particular, this algorithm runs in time using space.[1] Solving the problem by dynamic programming costs . The solutions to the generalized problem take space and time with dynamic programming and take time with a generalized suffix tree.

Suffix tree

Generalized suffix tree for the strings "ABAB", "BABA" and "ABBA", numbered 0, 1 and 2.

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it. The figure on the right is the suffix tree for the strings "ABAB", "BABA" and "ABBA", padded with unique string terminators, to become "ABAB$0", "BABA$1" and "ABBA$2". The nodes representing "A", "B", "AB" and "BA" all have descendant leaves from all of the strings, numbered 0, 1 and 2.

Building the suffix tree takes time (if the size of the alphabet is constant). If the tree is traversed from the bottom up with a bit vector telling which strings are seen below each node, the k-common substring problem can be solved in time. If the suffix tree is prepared for constant time lowest common ancestor retrieval, it can be solved in time.[2]

Dynamic programming

The following pseudocode finds the set of longest common substrings between two strings with dynamic programming:

function LongestCommonSubstring(S[1..r], T[1..n])
    L := array(1..r, 1..n)
    z := 0
    ret := {}

    for i := 1..r
        for j := 1..n
            if S[i] = T[j]
                if i = 1 or j = 1
                    L[i, j] := 1
                else
                    L[i, j] := L[i − 1, j − 1] + 1
                if L[i, j] > z
                    z := L[i, j]
                    ret := {S[i − z + 1..i]}
                else if L[i, j] = z
                    ret := ret ∪ {S[i − z + 1..i]}
            else
                L[i, j] := 0
    return ret

This algorithm runs in time. The array L stores the length of the longest common suffix of the prefixes S[1..i] and T[1..j] which end at position i and j, respectively. The variable z is used to hold the length of the longest common substring found so far. The set ret is used to hold the set of strings which are of length z. The set ret can be saved efficiently by just storing the index i, which is the last character of the longest common substring (of size z) instead of S[i-z+1..i]. Thus all the longest common substrings would be, for each i in ret, S[(ret[i]-z)..(ret[i])].

The following tricks can be used to reduce the memory usage of an implementation:

  • Keep only the last and current row of the DP table to save memory ( instead of )
    • The last and current row can be stored on the same 1D array by traversing the inner loop backwards
  • Store only non-zero values in the rows. This can be done using hash-tables instead of arrays. This is useful for large alphabets.

See also

References

  1. ^ Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub (Aug 2021). Mutzel, Petra; Pagh, Rasmus; Herman, Grzegorz (eds.). Faster Algorithms for Longest Common Substring. European Symposium on Algorithms. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 204. Schloss Dagstuhl. doi:10.4230/LIPIcs.ESA.2021.30. Here: Theorem 1, p.30:2.
  2. ^ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8.

Read other articles:

Canadian local newspaper Brampton GuardianTypeWeekly newspaperOwner(s)Metroland Media Group (Torstar)PublisherDana RobbinsFoundedAugust 13, 1964Political alignmentConservative (as of 2011)[1][2][3]LanguageEnglishHeadquarters3145 Wolfedale Rd.Mississauga, OntarioCirculation345,000Sister newspapersSouth Asian FocusISSN0841-6958OCLC number19105776 Websitewww.bramptonguardian.com The Brampton Guardian was a locally distributed, free, weekly community newspaper in Brampton,...

 

Settima riunione degli scienziati italianiNicola Santangelo, presidente Partecipanti1613 Apertura20 settembre 1845 Chiusura5 ottobre 1845 Stato Due Sicilie LocalitàNapoli VI Riunione VIII Riunione La Settima riunione degli scienziati italiani fu un incontro dei principali studiosi provenienti dai diversi Stati della penisola italiana svoltosi a Napoli nel 1845. Indice 1 Aspetti storici 2 Sezioni 2.1 Agronomia e tecnologia 2.2 Chimica 2.3 Zoologia, anatomia comparata e fisiologia 2.4 Med...

 

Coordenadas: 42° 43' N 12° 50' E Scheggino    Comuna   Localização SchegginoLocalização de Scheggino na Itália Coordenadas 42° 43' N 12° 50' E Região Úmbria Província Perugia Características geográficas Área total 35 km² População total 467 hab. Densidade 13,3 hab./km² Altitude 282 m Outros dados Comunas limítrofes Ferentillo (TR), Monteleone di Spoleto, Sant'Anatolia di Narco, Spoleto Código ISTAT 054047 Código cad...

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (نوفمبر 2013) نشأ علم النفس المجتمعي في إطار علم النفس الاجتماعي. ويؤكد هذا العلم على القوة الشاملة للبيئات الاجتماعية والمؤسسية والثقافية، بالإضافة إلى تشديده على دراس...

 

Untuk bekas akademi Konfusius, lihat Sungkyunkwan. Untuk stasiun kereta bawah tanah menuju Kampus Ilmu Pengetahuan Alam, lihat Stasiun Universitas Sungkyunkwan. Untuk universitas di Korea Utara, lihat Universitas Koryo Songgyungwan. Universitas Sungkyunkwan成均館大學校성균관대학교[1]Moto仁義禮智 인의예지Moto dalam bahasa InggrisHumanity, Righteousness, Propriety, Wisdom[2]JenisSwastaDidirikan1398AfiliasiKonfusianismePresidenChung Kyu-sangStaf akademik5...

 

ألعاب بارالمبية شتوية 1998 ناغانو، اليابان الدول المشاركة 32 الرياضيون المشاركون 571 المسابقات 4، في 122 رياضة انطلاق الألعاب 5 مارس المفتتح الرسمي ناروهيتو الملعب إم ويف الاختتام 14 مارس الموقع الرسمي الموقع الرسمي  تعديل مصدري - تعديل   الألعاب البارالمبية الشتوية 1998 هي ال...

Airport in Budapest, Hungary This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Budapest Ferenc Liszt International Airport – news · newspapers · books · scholar · JSTOR (September 2016) (Learn how and when to remove this template message) Budapest Ferenc Liszt International AirportBudapest Liszt Ferenc Nemzetk...

 

Sân bay Van Ferit Melen Mã IATAVAN Mã ICAOLTCI Thông tin chungKiểu sân baycôngCơ quan quản lýCục quan quản lý sân bay chính phủ Thổ Nhĩ KỳVị tríVanĐộ cao5,474 ft / 1,668 mTọa độ38°27′0″B 43°19′0″Đ / 38,45°B 43,31667°Đ / 38.45000; 43.31667Đường băng Hướng Chiều dài Bề mặt m ft 2.740 8.990 nhựa đường Sân bay Van Ferit Melen (IATA: VAN, ICAO: LTCI) là một sân bay ở Van, thành phố v...

 

Liga Inggris Divisi PertamaMusim1991–92JuaraLeeds United (ke-3, Liga Inggris)DegradasiLuton Town Notts CountyWest Ham UnitedLiga Champions UEFA 1992–93Leeds UnitedJuara Piala FA Piala Winners UEFALiverpool (gelar ke-5)Liga Eropa UEFAManchester United Sheffield WednesdayJumlah pertandingan462Pencetak golterbanyakIan Wright (Crystal Palace / Arsenal), 29 [1]Kemenangan kandangterbesarArsenal – Sheffield Wednesday 7–1 (15 Feb 1992)Kemenangan tandangterbesarSheffield Wednesday – ...

This article needs to be updated. Please help update this article to reflect recent events or newly available information. (July 2017) Part of a series on theUN Security Councilresolutions Permanent members China France RussiaUnited Kingdom United States Non-permanent members Lists of resolutions Resolutions 1 to 1000           (1946–1995) 001 to 0100 (1946–1953) 101 to 0200 (1953–1965) 201 to 0300 (1965–1971) 301 to 0400 (1971–1976)...

 

Cet article dresse une liste des espaces verts de Paris, en France. Statistiques Plan des espaces verts de Paris. En mai 2008, Paris comptait 450 espaces verts, répartis ainsi : 2 bois, totalisant 1 841 ha, soit 17,5 % de la superficie totale de la commune. Le bois de Vincennes comprend en outre l'arboretum de l'école du Breuil, le jardin tropical de Paris et le parc floral de Paris, et le bois de Boulogne le parc de Bagatelle, le jardin d'acclimatation, le jardin des serres ...

 

 Nota: Não confundir com Caio Semprônio Tuditano, pretor da Hispânia Citerior morto na Revolta Ibérica de 197-195 a.C.. Caio Semprônio Tuditano Cônsul da República Romana Consulado 129 a.C. Caio Semprônio Tuditano (em latim: Caius Sempronius Tuditanus) foi um político da gente Semprônia da República Romana eleito cônsul em 129 a.C. com Mânio Aquílio. Família Tuditano era membro da família plebeia dos Semprônios. Seu pai tinha o mesmo nome e era um senador que participou ...

德國國防軍陸軍第526後備師存在時期1944年9月至1945年5月國家或地區 納粹德國部門陸軍種類步兵規模師 德國國防軍陸軍第526後備師(德語:526. Reserve-Division)是納粹德國國防軍陸軍的一個步兵師。該師於1944年9月改编自526號亞琛師。 該師改編後,一直在西線作戰。 該師最後於1945年5月解散。 參考文獻 Mitcham, Samuel W. German Order of Battle. Volume Two: 291st-999th Infantry Divisions, Named I...

 

«Александр Обухов» Александр Обухов «Александр Обухов» у причала в бухте Золотой Рог  СССР Класс и тип судна рыбоконсервная плавучая база класса «Андрей Захаров» Порт приписки Владивосток Номер ИМО 5400396 Организация Дальморепродукт Изготовитель Адмиралтейский заво...

 

Suburb of the Thessaloniki Urban Area, Greece Place in Macedonia, GreeceAmpelokipoi ΑμπελόκηποιAmpelokipoiLocation within the regional unit Coordinates: 40°39′N 22°55′E / 40.650°N 22.917°E / 40.650; 22.917CountryGreeceGeographic region MacedoniaAdministrative regionCentral MacedoniaRegional unitThessalonikiMunicipalityAmpelokipoi-Menemeni • Municipal unit1.803 km2 (0.696 sq mi)Elevation5 m (16 ft)Population&#...

中央军委训练管理部军事体育训练中心 主要领导 主任  大校 政治委员 曹保民 大校 机构概况 上级机构 中央军事委员会训练管理部 机构类型 中央军委训练管理部直属单位 联络方式 总部  实际地址 北京市 机构沿革 成立时间 2017年 对应机构 中央军委训练管理部军事体育训练中心,隶属中央军委训练管理部,负责军事体育训练工作。 沿革 八一体工大队 中国人民解放...

 

Seven the Hard WayÁlbum de Pat BenatarPublicación Noviembre de 1985Grabación 1985Género(s) RockDuración 37:05Discográfica Chrysalis RecordsProductor(es) Neil Giraldo, Mike Chapman & William WittmanCalificaciones profesionales Allmusic enlace Rolling Stone enlace Cronología de Pat Benatar Tropico(1984) Seven the Hard Way Wide Awake in Dreamland(1988) Sencillos de Seven the Hard Way «Invincible»Publicado: 1985 «Sex as a Weapon»Publicado: 1985 «Le Bel Age»Publicado: 1986 [e...

 

2019 COSAFA CupTournament detailsHost countrySouth AfricaDates25 May–8 June[1]Teams13 (from 2 sub-confederations)Venue(s)3 (in 3 host cities)Final positionsChampions Zambia (5th title)Runners-up BotswanaThird place ZimbabweFourth place LesothoTournament statisticsMatches played20Goals scored48 (2.4 per match)Top scorer(s) Gabadinho Mhango Gerald Phiri Jr. Ashley Nazira(3 goals each)← 2018 2020 2021 → International football comp...

German resistance member This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Albrecht Mertz von Quirnheim – news · newspapers · books · scholar · JSTOR (July 2016) (Learn how and when to remove this template message) Albrecht Mertz von QuirnheimAlbrecht Mertz von QuirnheimBorn(1905-03-25)25 March 1905Munich, Kin...

 

Yakage 矢掛町Kota kecil BenderaLokasi Yakage di Prefektur OkayamaNegara JepangWilayahChūgokuPrefektur OkayamaDistrikOdaLuas • Total90,6 km2 (350 sq mi)Populasi (Oktober 1, 2015) • Total14.201 • Kepadatan156,7/km2 (4,060/sq mi)Zona waktuUTC+9 (JST)Kode pos714-1297Simbol • PohonPinus• BungaPrunus serrulata• BurungHorornis diphoneAlamat3018 Yakage, Yakage-chō, Oda-gun, Okayama-kenSitus webSitus web resmi Yakag...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!