算法導論32章答案


32 String Matching

32.1-2

Suppose that all characters in the pattern P are different. Show how to accelerate NAIVE-STRING-MATCHER to run in time O.n/ on an n-character text T.

Naive-Search(T,P)
for s = 1 to n – m + 1
     j = 0
     while T[s+j] == P[j] do
            j = j + 1
            if j = m     return s
     s = j + s;

該算法實際只是會掃描整個字符串的每個字符一次,所以其時間復雜度為O(n).

31.1-3

Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet ∑d = {0,1,2,..,d-1},where d ≧ 2.Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is

over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

當第4行隱含的循環執行i次時,其概率P為:

  • P = 1/Ki-1 * (1-1/k), if i < m
  • P = 1/Km-1 * (1-1/k) + 1/Km , if i = m

可以計算每次for循環迭代時,第4行的循環的平均迭代次數為:

[1*(1-1/k)+2*(1/K)*(1-1/k)+3*(1/k2)(1-1/k)+…+(m-1)*(1-km-2)(1-1/k) +m*(1/km-1)(1-1/k) + m*(1/km)]
= 1 - 1/k + 2/k - 2/k2 + 3/k2 - 3/k3 +...+ m/km-1 - m/km + m/km
= 1 + 1/k + 1/k2 +...+ 1/km-1
= (1 - 1/Km) / (1 - 1/k)
≤ 2

所以,可知,第4行循環的總迭代次數為:(n-m+1) * [(1-1/Km) / (1-1/k)] ≤ 2 (n-m+1)

31.1-4

Suppose we allow the pattern P to contain occurrences of a gap character } that can match an arbitrary string of characters (even one of zero length). For example, the pattern ab}ba}c occurs in the text cabccbacbacab as

and as

Note that the gap character may occur an arbitrary number of times in the pattern but not at all in the text. Give a polynomial-time algorithm to determine whether such a pattern P occurs in a given text T, and analyze the running time of your algorithm.


32.3-5

該算法只是要求判斷是否模式P出現在該字符串中,那么問題被簡化了許多。對於該問題而言,我們可以模式P中的gap為分隔符,將原字符串分解為多個子字符串{P1,P2,...},而后,在T中依次尋找這些字符串,必須保證Pi+1在Pi之后。其偽代碼如下:

Gap-Naive-Search(T,P)
n = T.length
m = P.length
i = 0;
j = 0;
while(i ≦ n)
   //直接刪去下一個字符串前的gap字符
   while(i ≦ m && P[i] == gap)
       i++;
   if i > m  return true;
   //找到下一個需要進行匹配的子串
   k = 0;
   while(P[i+k] != gap)
       k++;
   s = Naive-Search(T[j..n],P[i..i+k-1]);
   if s == -1  return false
   i = i + s;
   j = j + k;
 
Naive-Search(T,P)
n = T.length;
m = P.length;
for s = 1 to n – m + 1
     j = 0
     while T[s+j] == P[j]  do
            j = j + 1;
            if j = m   return s
return -1

對於該算法的時間復雜度分析,對於最外層循環中嵌套的兩個while循環,在整個算法執行過程中,其實際上只是遍歷了字符串T一次,可以二者的總時間復雜度為O(n).至於其中對於函數Naive-Search(T,P)的調用,可以觀察到在每次調用Naive-Search(T,P)中,其比較次數為:(n-j-k+1)k,而所有的調用Naive-Search(T,P)的時間復雜度∑( n-j-ki+1)ki < n∑ki < nm,其時間復雜度為O(mn).故其總時間復雜度為O(n) + O(mn) = O(mn).

32.2-1

Working modulo q = 11, howmany spurious hits does the Rabin-Karp matcher encounter in the text T = 3141592653589793 when looking for the pattern P = 26?

有以下程序即可計算得到,最終,valid hit:1,spurious hit:3

char P[17] = "3141592653589793";
    int m = 16;
    int q = 11;
    int n = 10 % q;
    int j = ((2 * n) + 6 )% q;
    int count1 = 0;
    int count2 = 0;
    for (int s = 0; s < m ; s++) {
        int sum = ((P[s] - '0') * n + (P[s + 1] - '0') )% q;
        if( sum == j){
            if (P[s] == '2' && P[s + 1] == '6')
                count1++;
            else
                count2++;
        }
    }
    printf("valid hit:%d,spurious hit:%d", count1,count2);

32.2-2

How would you extend the Rabin-Karp method to the problem of searching a text string for an occurrence of any one of a given set of k patterns? Start by assuming that all k patterns have the same length. Then generalize your solution to allow the patterns to have different lengths.

匹配k個相同長度的模式P

Rabin-Karp-Search(T[1...n],P[1...k][1...m],d)
q = a prime larger than m;
c = d^(m-1) mod q; // run a loop multiplying by 10 mod q
for i = 1 to k
    fp[i] = 0; 
ft = 0;
for i = 1 to m // preprocessing
    ft = (d*ft + T[i]) mod q;
    for j = 1 to k
       fp[j] = (d*fp[j] + P[j][i]) mod q;
for s = 0 to n – m // matching
    for j = 1 to k
       if fp[j] = ft // run a loop to compare strings
           if P[j][1..m] = T[s+1..s+m]   
               print “Pattern:P[j] occurs with shift” s;
    if s < n-m
       ft = ((ft – T[s]*c)*d + T[s + m + 1]) mod q;

O(k) + O(mk) + O(km(n-m+1)) = O(km(n-m+1))

匹配k個不同長度的模式P

Rabin-Karp-Search(T[1...n],P[1...k][1...m],d)
q = a prime larger than m;
for i = 1 to k
   m[i] = P[i],length;
for i = 1 to k
   c[i] = d^(m[i]-1) mod q; // run a loop multiplying by d mod q
for i = 1 to k
    fp[i] = 0; 
    ft[j] = 0;
for i = 1 to k // preprocessing
    for j = 1 to m[i] 
       ft[i] = (d*ft[i] + T[j]) mod q;
       fp[i] = (d*fp[i] + P[i][j]) mod q;
for s = 0 to n – m // matching
    for j = 1 to k
       if fp[j] = ft[j] // run a loop to compare strings
           if P[j][1..m[j]] = T[s+1..s+m[j]]   
               print “Pattern:P[j] occurs with shift” s;
       for i = 1 to k
           if s < n - m[i]
              ft[i] = ((ft[i] – T[s]*c)*d + T[s + m[i] + 1]) mod q;

O(k) + O(k) + O(k) +O(k * max{P[1..k].length}) + O(k * max{P[1..k].length} * (n-m)) = O(k * max{P[1..k].length} * (n-m))

32.2-3 使用hash函數解

Show how to extend the Rabin-Karp method to handle the problem of looking for agiven m * m pattern in an n * n array of characters. (The pattern may be shifted vertically and horizontally, but it may not be rotated.)

這里可以適用哈希函數進行整個矩陣的表示

該算法采用的總體思想是,將一個m*m的矩陣塊分割為m行進行處理,每一行都采用Rabin-Karp-Search算法的函數處理模式。而處理順序為:一行一行地處理,此時就可以利用相關信息進行遞推運算,對於列,則重新計算。

Rabin-Karp-Search(T[1...n][1...n],P[1...m][1...m],d)
q = a prime larger than m;
c = d^(m-1) mod q; // run a loop multiplying by 10 mod q
for s1 = 0 to n - m // 控制列的遍歷
   for i = 1 to m
      fp[i] = 0; 
      ft[i] = 0;
   for i = 1 to m // preprocessing
      for j = 1 to m
        fp[i] = (d*fp[i] + P[s1+i][j]) mod q;
        ft[i] = (d*ft[i] + T[s1+i][j]) mod q;
   for s2 = 0 to n – m // matching
       if fp[1...m] = ft[1..m] // run a loop to compare strings
          if P[1...m][1...m] = T[s1+1...s1+m][s2+1..s2+m]   
             print “Pattern occurs with shift” s1 s2
       if s2 < n-m
          for i = 0 to m
             ft[i] = ((ft[i] – T[s2]*c)*d + T[s1+i][s2 + m + 1]) mod q;

m * { O(m) + O(m2) + O((n-m) * (m*m2 )+m))} = O(m3 (n-m))

32.2-4 未完成

32.3-1

Construct the string-matching automaton for the pattern P = aabab and illustrate its operation on the text string T = aaababaabaababaab.

the transition function

q a b
0 1 0
1 2 0
2 2 3
3 4 0
4 2 5
5 1 0

operation

a a a b a b a a b a a b a b a a b
0 1 2 2 3 4 5 1 2 3 4 2 3 4 5 1 2 3

32.3-2

Draw a state-transition diagram for a string-matching automaton for the pattern ababbabbababbababbabb over the alphabet ∑ = {a,b}.

給出計算轉換函數的代碼:

char P[23] = " ababbabbababbababbabb";
    char Q[4] = " ab";
    int A[22][3];
    int m = 21;
    int n = 2;
    int sum = 0;
    for (int q = 0; q <= m ; q++) {
        for (int p = 1; p <= n; p++) {
            int k = ((m + 1) < (q + 2)) ? m + 1 : q + 2;
            do {
                k--;
                sum = 0;
                //比較P[1..k]與P[q-k+2..q]a
                int i = 1;
                while (i <= k - 1 && P[i] == P[q - k + i + 1])
                    i++;
                if (i == k && P[k] == Q[p]) {
                    sum = k;
                    break;
                }
            } while (k > 1);
            A[q][p] = sum;
        }
    }

32.3-3

We call a pattern P nonoverlappable if Pk = Pq implies k = 0 or k = q.Describe the state-transition diagram of the string-matching automaton for a nonoverlappable pattern.

The state transition function looks like a straight line, with all other edgesgoing back to either the initial vertex (if it is not the first letter of the pattern) orthe second vertex (if it is the first letter of the pattern). If it were to go back toany later state, that would mean that some suffix of what we had constructedso far(which was a prefix of P) was a prefix of the copy of P that we are nexttrying to find.

32.3-4

Given two patterns P and P', describe how to construct a finite automaton that determines all occurrences of either pattern. Try to minimize the number of states in your automaton.

其對於這個問題的描述,剛開始,個人不是很理解,如果它僅僅是要求尋找到兩個模式之一的全部匹配,那么完全可以適用其中某個模式的有窮自動機來完成這個問題。那么這道題就沒有任何的存在意義。

這是網絡上的一個題解:

We can construct the automaton as follows: Let Pk be the longest prefix which both P and P' have in common. Create the automaton F for Pk as usual. Add an arrow labeled P[k+1] from state k to a chain of states k+1, k+2, . . . , |P|,and draw the appropriate arrows, so that δ(q, a) = σ(Pka). Next, add an arrow labeled P'[k + 1] from state k to a chain of states (k + 1)',(k + 2)', . . . , |P'|'.Draw the appropriate arrows so that δ(q, a) = σ(P'ka).

如果狀態k經過字符a轉向狀態k+1(匹配P),經過字符b轉向狀態(k+1)'(匹配P'),那么在P的有窮自動機中,狀態k經過字符b指向了何處,這就會導致對於模式P匹配的缺失。同理也造成了對於模式P'匹配的缺失。可以舉一個例子為:P = abaab,P' = ababa,T = ababaab,其沒有匹配ab[abaab],匹配成功[ababa]ab。

上述說明該自動機的特點:如果某字符串同時存在多個可以匹配P或P'的子串,且這些子串之間存在重疊,那么僅僅會識別一個;如果僅存在一個可以匹配P或P'的子串,那么必然可以識別。這大概是這道題目的意思。

同時,對於某些題解將P和P'的后綴也聯系起來,是很難完成。即對於后綴的狀態而言,雖然匹配成功時相同,但匹配失敗時,未必相同。這里就存在具體的細節問題,如果它返回的狀態可以P和P'中不相同的那部分狀態,那么究竟應該返回哪一個?這里就會造成某個字符串匹配的缺失。(這里並非是由重疊引起的匹配缺失,而極有可能是單獨字符串引起的缺失。這是不可以容忍的。)

32.3-5

Given a pattern P containing gap characters (see Exercise 32.1-4), show how to build a finite automaton that can find an occurrence of P in a text T in O(n) matching time, where n = (T).

對於這一問題,同樣采用32.1-4中的判斷思路,參見 解題思路.對於有窮自動機而言,其構建更加簡單,只需要分別構建匹配P中間隔子串的有窮自動機,而后依照原序將之相連。

連接的具體細節為:依序相鄰兩個間隔子串A,B,A的接受狀態的所有轉換都將指向B的起始狀態。

32.4-1

Compute the prefix function π for the pattern ababbabbabbababbabb.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 1 2 0 1 2 0 1 2 0 1 2 3 4 5 6 7 8

這里可以采用三種方法,依據π的定義。依據算法手動推導,直接寫出算法來運行。

1. char P[21] = " ababbabbabbababbabb";
2. int A[20];
3. A[1] = 0;
4. int k = 0;
5. int m = 19;
6. for (int q = 2; q <= m; q++) {
7.     while (k > 0 && P[k + 1] != P[q])
8.         k = A[k];
9.     if (P[k + 1] == P[q])
10.         k++;
11.     A[q] = k;
12. }

32.4-2

Give an upper bound on the size of π*(q) as a function of q. Give an example to show that your bound is tight.

the upper bound is the length of P.

P:aaaaaaaaaa...a, the length of P is m.
π*(m) = {0,1,2,3,...,m-1}

32.4-3

Explain how to determine the occurrences of pattern P in the text T by examining the π function for the string PT (the string of length m+n that is the concatenation of P and T).

{q - m | m ∈ π*(q) && q >= 2m }, m = P.length.

至於這里的判斷條件必須為m ∈ π*(q) 而不是 π(q) >= m
或者m = π(q)。這兩種判斷情況無法完全找出所有的可以與P匹配的字符串,例如

  • 針對m = π(q), P = aa,T = aaaaaaa.
  • 針對π(q) >= m, P = ab,T = ababababab.

32.4-4

Use an aggregate analysis to show that the running time of KMP-MATCHER is O(n).

KMP算法代碼

1. KMP-Matcher(T,P)
2. n = T.length;
3. m = P.length;
4. π = COMPUTE-PREFIX-FUNTION(p);
5. q = 0;
6. For i = 1 to n   /*scan the text from left to right*/
7.      while q > 0 and P[q+1] != T[i]
8.              q = π[q];
9.      IF P[q+1] == T[i]
10.           q++;
11.      IF q == m
12.           print "Pattern occurs with shift" i-m;
13.       q = π[q]; 
  • 在第6行的for循環開始迭代前,q=0,而在第6行的for循環中,僅僅只有第9~10行的if語句可以使得q = q+1,也就是說,q在for循環的每一次迭代中,至多加1。q值至多為n。
  • COMPUTE-PREFIX-FUNCTION(P)算法中,不管從代碼角度(算法時間復雜度證明),或者從語義角度(算法正確性證明),都可以得出 π(q) < q,則,第7~8行的while循環將會減小q值,第13行的賦值也會降低q值。
  • 而k值非負。可以得出的結論是q值至多下降n次,即在5-10行的for循環的全部迭代中,第7~8行的while循環至多執行n次。

那么該算法的時間復雜度為O(n) + O(n) = O(n)。

32.4-5 未完成

Use a potential function to show that the running time of KMP-MATCHER is Θ(n).

32.4-6

Show how to improve KMP-MATCHER by replacing the occurrence of π in line 7 (but not line 12) by π',where 'π is defined recursively for q = 1,2,...,m-1 by the equation

Explain why the modified algorithm is correct, and explain in what sense this change constitutes an improvement.

π'僅僅在第7行替換了π,如果在第6~7行的循環迭代完成后,q在修改前后不變即可證明正確性。

第6~7行while循環的作用是:當匹配P[q+1]!=T[i]時,尋找最大的k滿足:k < q && Pk是Pq的后綴,同時,要求P[k+1] == T[i]。從實現角度來講,就是遞減的遍歷π*[q],尋找P[k+1] == T[i]。

而觀察到π',它與π最大的不同就是,它僅僅選擇尋找最大的k,滿足k < q && Pk是Pq的后綴,P[k+1] != P[q+1]。所以π'*[q]是π*[q]的子集,其剔除了其中P[k+1] == P[q+1]的部分。

也就是說,當我們更改π為π',對於π'*[q]的遍歷將不會考慮P[k+1]==P[q+1]的k值,而已知P[q+1] != T[i],這一部分的k值正好是我們需要剔除的。所以,其不會算法的影響,反而會是原算法的一種優化。

32.4-7 未完成

Give a linear-time algorithm to determine whether a text T is a cyclic rotation of another string T0. For example, arc and car are cyclic rotations of each other.

32.4-8

Give an O(m|∑|-time algorithm for computing the transition function δ for the string-matching automaton corresponding to a given pattern P.(Hint: Prove that δ(q,a) = δ(π(q),a),if q = m or P[q + 1] != a.)

1. π= COMPUTE-PREFIX-FUNCTION(P)
2. for a∈Σ* do
3.     δ(0,a) = 0
4. end for
5. δ(0,P[1]) = 1
6. for  a∈Σ* do
7.     for  q= 1 to m do
8.           if  q==m or P[q+ 1] != a  then
9.                δ(q,a) =δ(π[q],a)
10.           else
11.                δ(q,a) =q+ 1
12.           end if
13.     end for
14. end for

第2~5行完成對於δ(0,a)的賦值,任意的a∈Σ*。它的賦值是正確的,這里不做過多解釋。

第6行,第7行的循環分別遍歷字符表以及狀態集,對於第8行的判斷,需要分情況進行討論,

  • 當判斷為false時,即P[q+1] == a的情況,即q = q + 1,這是易於理解的。

  • 對於判斷為true時,且q == m的情況而言,可以參見本文KMP主算法的細節補充:關於δ(m,a) = δ(π(m),a)的證明。(在KMP算法最后方)

  • 對於判斷為true時,且q!=m && P[q+ 1] != a的情況而言,這一部分的證明,其實也完全可以采用δ(m,a) = δ(π(m),a)的證明,只需要將m更改為q即可。因為在此證明中,並沒有Pm = P的特性,所以對於一般情況也是成立的。而下面將給出個人對於這一情況的一種理解。

定理,令δ是匹配字符串P的有窮自動機,其具體定義符合上文中所述的字符串匹配有窮自動機,P.length = m,則,δ(m,a) = δ(π(m),a) 。

證明,

  • δ(m,a) = σ(Pma),且,δ(π(m),a) = σ(Pπ(m)a)。
  • 由π(q) = max{k: k < q && Pk是Pq的后綴}的定義可知,Pπ(m)是Pm的后綴,那么,Pπ(m)a是Pma的后綴,由σ(x) = max{ k:Pk是x的后綴 }的定義可知,一個字符串的σ絕不會小於該字符串后綴的σ,可知:σ(Pma) ≥ σ(Pπ(m)a)。
  • 因為σ(Pma) = max{ k:Pk是Pma的后綴 },σ(Pma) - 1∈ { k:Pk是Pm的后綴 },而π(m) = max{k: k < m && Pk是Pm的后綴}。所以,π(m) ≥ σ(Pma) - 1。而P[σ(Pma)] = a,那么,σ(Pπ(m)a) ≥ σ(Pma) 。這是因為已知Pσ(Pma) - 1是Pπ(m)的前綴,而P[σ(Pma)] = a,也就是說,σ(Pπ(m)a) ∈ { k:Pk是Pπ(m)a的后綴 },所以,σ(Pπ(m)a) ≥ σ(Pma) 。
  • 由σ(Pπ(m)a) ≥ σ(Pma)以及σ(Pma) ≥ σ(Pπ(m)a)可知,δ(m,a) = δ(π(m),a) 。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM