題目

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

原題鏈接：https://oj.leetcode.com/problems/repeated-dna-sequences/

straight-forward method（TLE）

算法分析

直接字符串匹配；設計next數組，存字符串中每個字母在其中后續出現的位置；遍歷時以next數組為起始。

簡化考慮長度為4的字符串

case1:

src A C G T A C G T

next [4] [5] [6] [7] [-1] [-1] [-1] [-1]

那么匹配ACGT字符串的過程，匹配next[0]之后的3位字符即可

case2：

src A C G T A A C G T

next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]

多個A字符后繼，那么需要匹配所有后繼，匹配next[0]不符合之后，還要匹配next[next[0]]

case3：

src A A A A A A

next [1] [2] [3] [4] [5] [-1]

重復的情況，在next[0]匹配成功時，可以把next[next[0]]置為-1，即以next[0]開始的長度為4的字符串已經成功匹配過了，無需再次匹配了；當然這么做只能減少重復的情況，並不能消除重復，因此仍需要使用一個set存儲匹配成功的結果，方便去重

時間復雜度

構造next數組的復雜度O(n^2)，遍歷的復雜度O(n^2)；總時間復雜度O(n^2)

代碼實現

 1 #include <string>
 2 #include <vector>
 3 #include <set>
 4 
 5 class Solution {
 6 public:
 7     std::vector<std::string> findRepeatedDnaSequences(std::string s);
 8 
 9     ~Solution();
10 
11 private:
12     std::size_t* next;
13 };
14 
15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
16     std::vector<std::string> rel;
17 
18     if (s.length() <= 10) {
19         return rel;
20     }
21 
22     next = new std::size_t[s.length()];
23 
24     // cal next array
25     for (int pos = 0; pos < s.length(); ++pos) {
26         next[pos] = s.find_first_of(s[pos], pos + 1);
27     }
28 
29     std::set<std::string> tmpRel;
30 
31     for (int pos = 0; pos < s.length(); ++pos) {
32         std::size_t nextPos = next[pos];
33         while (nextPos != std::string::npos) {
34             int ic = pos;
35             int in = nextPos;
36             int count = 0;
37             while (in != s.length() && count < 9 && s[++ic] == s[++in]) {
38                 ++count;
39             }
40             if (count == 9) {
41                 tmpRel.insert(s.substr(pos, 10));
42                 next[nextPos] = std::string::npos;
43             }
44             nextPos = next[nextPos];
45         }
46     }
47 
48     for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {
49         rel.push_back(*itr);
50     }
51 
52     return rel;
53 }
54 
55 Solution::~Solution() {
56     delete [] next;
57 }

View Code

hash table plus bit manipulation method

（view the Show Tags and Runtime 10ms !）

算法分析

首先考慮將ACGT進行二進制編碼

A -> 00

C -> 01

G -> 10

T -> 11

在編碼的情況下，每10位字符串的組合即為一個數字，且10位的字符串有20位；一般來說int有4個字節，32位，即可以用於對應一個10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

20位的二進制數，至多有2^20種組合，因此hash table的大小為2^20，即1024 * 1024，將hash table設計為bool hashTable[1024 * 1024];

遍歷字符串的設計

每次向右移動1位字符，相當於字符串對應的int值左移2位，再將其最低2位置為新的字符的編碼值，最后將高2位置0。例如

src CAAAAAAAAAC

subStr CAAAAAAAAA

int 0100000000

subStr AAAAAAAAAC

int 0000000001

時間復雜度

字符串遍歷O(n)，hash tableO(1)；總時間復雜度O(n)

代碼實現

 1 #include <string>
 2 #include <vector>
 3 #include <unordered_set>
 4 #include <cstring>
 5 
 6 bool hashMap[1024*1024];
 7 
 8 class Solution {
 9 public:
10     std::vector<std::string> findRepeatedDnaSequences(std::string s);
11 };
12 
13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
14     std::vector<std::string> rel;
15     if (s.length() <= 10) {
16         return rel;
17     }
18 
19     // map char to code
20     unsigned char convert[26];
21     convert[0] = 0; // 'A' - 'A'  00
22     convert[2] = 1; // 'C' - 'A'  01
23     convert[6] = 2; // 'G' - 'A'  10
24     convert[19] = 3; // 'T' - 'A' 11
25 
26     // initial process
27     // as ten length string
28     memset(hashMap, false, sizeof(hashMap));
29 
30     int hashValue = 0;
31 
32     for (int pos = 0; pos < 10; ++pos) {
33         hashValue <<= 2;
34         hashValue |= convert[s[pos] - 'A'];
35     }
36 
37     hashMap[hashValue] = true;
38 
39     std::unordered_set<int> strHashValue;
40 
41     // 
42     for (int pos = 10; pos < s.length(); ++pos) {
43         hashValue <<= 2;
44         hashValue |= convert[s[pos] - 'A'];
45         hashValue &= ~(0x300000);
46         
47         if (hashMap[hashValue]) {
48             if (strHashValue.find(hashValue) == strHashValue.end()) {
49                 rel.push_back(s.substr(pos - 9, 10));
50                 strHashValue.insert(hashValue);
51             }
52         } else {
53             hashMap[hashValue] = true;
54         }
55     }
56 
57     return rel; 
58 }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [LeetCode] 187. Repeated DNA Sequences 求重復的DNA序列括號匹配-算法詳細題解LeetCode KMP - LeetCode #459 Repeated Substring Pattern [LeetCode] 946. Validate Stack Sequences 驗證棧序列 [LeetCode] Repeated String Match 重復字符串匹配 LeetCode 1100. Find K-Length Substrings With No Repeated Characters [LeetCode] Minimum Swaps To Make Sequences Increasing 使得序列遞增的最小交換 [LeetCode] Repeated Substring Pattern 重復子字符串模式 [LeetCode] 1156. Swap For Longest Repeated Character Substring 單字符重復子串的最大長度 [LeetCode] 718. Maximum Length of Repeated Subarray 最長的重復子數組