Techioz Blog

文字列内の同一部分を検出するにはどうすればよいですか?

概要

デコードアルゴリズムで必要な質問を小さな質問に分割しようとします。これはパート I です。

質問:

例1:

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"

the identical parts are "2010", "-", "1" and "visitor"

例 2:

s1 = "Welcome, John!"
s2 = "Welcome, Peter!"

the identical parts are "Welcome," and "!"

例 3: (「!」の例を明確にするため)

s1 = "Welcome, Sam!"
s2 = "Welcome, Tom!"

the identical parts are "Welcome," and "m!"

Python と Ruby を推奨します。ありがとう

解決策

編集: #1 を含むすべての例で動作するようにこの例を更新しました。

def scan(s1, s2):
    # Find the longest match where s1 starts with s2
    # Returns None if no matches
    l = len(s1)
    while 1:
        if not l:
            return None
        elif s1[:l] == s2[:l]:
            return s1[:l]
        else:
            l -= 1

def contains(s1, s2):
    D = {} # Remove duplicates using a dict
    L1 = s1.split(' ')
    L2 = s2.split(' ')

    # Don't add results which have already 
    # been processed to satisfy example #1!
    DProcessed = {}

    for x in L1:
        yy = 0
        for y in L2:
            if yy in DProcessed:
                yy += 1
                continue

            # Scan from the start to the end of the words
            a = scan(x, y)
            if a: 
                DProcessed[yy] = None
                D[a] = None
                break

            # Scan from the end to the start of the words
            a = scan(x[::-1], y[::-1])
            if a: 
                DProcessed[yy] = None
                D[a[::-1]] = None
                break
            yy += 1

    return list(D.keys())

print contains("12 November 2010 - 1 visitor",
               "6 July 2010 - 100 visitors")
print contains("Welcome, John!",
               "Welcome, Peter!")
print contains("Welcome, Sam!",
               "Welcome, Tom!")

どの出力:

['1', 'visitor', '-', '2010']
['Welcome,', '!']
['Welcome,', 'm!']