題目
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
原題鏈接:https://oj.leetcode.com/problems/repeated-dna-sequences/
straight-forward method(TLE)
算法分析
直接字符串匹配;設計next數組,存字符串中每個字母在其中后續出現的位置;遍歷時以next數組為起始。
簡化考慮長度為4的字符串
case1:
src A C G T A C G T
next [4] [5] [6] [7] [-1] [-1] [-1] [-1]
那么匹配ACGT字符串的過程,匹配next[0]之后的3位字符即可
case2:
src A C G T A A C G T
next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]
多個A字符后繼,那么需要匹配所有后繼,匹配next[0]不符合之后,還要匹配next[next[0]]
case3:
src A A A A A A
next [1] [2] [3] [4] [5] [-1]
重復的情況,在next[0]匹配成功時,可以把next[next[0]]置為-1,即以next[0]開始的長度為4的字符串已經成功匹配過了,無需再次匹配了;當然這么做只能減少重復的情況,並不能消除重復,因此仍需要使用一個set存儲匹配成功的結果,方便去重
時間復雜度
構造next數組的復雜度O(n^2),遍歷的復雜度O(n^2);總時間復雜度O(n^2)
代碼實現

1 #include <string> 2 #include <vector> 3 #include <set> 4 5 class Solution { 6 public: 7 std::vector<std::string> findRepeatedDnaSequences(std::string s); 8 9 ~Solution(); 10 11 private: 12 std::size_t* next; 13 }; 14 15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) { 16 std::vector<std::string> rel; 17 18 if (s.length() <= 10) { 19 return rel; 20 } 21 22 next = new std::size_t[s.length()]; 23 24 // cal next array 25 for (int pos = 0; pos < s.length(); ++pos) { 26 next[pos] = s.find_first_of(s[pos], pos + 1); 27 } 28 29 std::set<std::string> tmpRel; 30 31 for (int pos = 0; pos < s.length(); ++pos) { 32 std::size_t nextPos = next[pos]; 33 while (nextPos != std::string::npos) { 34 int ic = pos; 35 int in = nextPos; 36 int count = 0; 37 while (in != s.length() && count < 9 && s[++ic] == s[++in]) { 38 ++count; 39 } 40 if (count == 9) { 41 tmpRel.insert(s.substr(pos, 10)); 42 next[nextPos] = std::string::npos; 43 } 44 nextPos = next[nextPos]; 45 } 46 } 47 48 for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) { 49 rel.push_back(*itr); 50 } 51 52 return rel; 53 } 54 55 Solution::~Solution() { 56 delete [] next; 57 }
hash table plus bit manipulation method
(view the Show Tags and Runtime 10ms !)
算法分析
首先考慮將ACGT進行二進制編碼
A -> 00
C -> 01
G -> 10
T -> 11
在編碼的情況下,每10位字符串的組合即為一個數字,且10位的字符串有20位;一般來說int有4個字節,32位,即可以用於對應一個10位的字符串。例如
ACGTACGTAC -> 00011011000110110001
AAAAAAAAAA -> 00000000000000000000
20位的二進制數,至多有2^20種組合,因此hash table的大小為2^20,即1024 * 1024,將hash table設計為bool hashTable[1024 * 1024];
遍歷字符串的設計
每次向右移動1位字符,相當於字符串對應的int值左移2位,再將其最低2位置為新的字符的編碼值,最后將高2位置0。例如
src CAAAAAAAAAC
subStr CAAAAAAAAA
int 0100000000
subStr AAAAAAAAAC
int 0000000001
時間復雜度
字符串遍歷O(n),hash tableO(1);總時間復雜度O(n)
代碼實現
1 #include <string> 2 #include <vector> 3 #include <unordered_set> 4 #include <cstring> 5 6 bool hashMap[1024*1024]; 7 8 class Solution { 9 public: 10 std::vector<std::string> findRepeatedDnaSequences(std::string s); 11 }; 12 13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) { 14 std::vector<std::string> rel; 15 if (s.length() <= 10) { 16 return rel; 17 } 18 19 // map char to code 20 unsigned char convert[26]; 21 convert[0] = 0; // 'A' - 'A' 00 22 convert[2] = 1; // 'C' - 'A' 01 23 convert[6] = 2; // 'G' - 'A' 10 24 convert[19] = 3; // 'T' - 'A' 11 25 26 // initial process 27 // as ten length string 28 memset(hashMap, false, sizeof(hashMap)); 29 30 int hashValue = 0; 31 32 for (int pos = 0; pos < 10; ++pos) { 33 hashValue <<= 2; 34 hashValue |= convert[s[pos] - 'A']; 35 } 36 37 hashMap[hashValue] = true; 38 39 std::unordered_set<int> strHashValue; 40 41 // 42 for (int pos = 10; pos < s.length(); ++pos) { 43 hashValue <<= 2; 44 hashValue |= convert[s[pos] - 'A']; 45 hashValue &= ~(0x300000); 46 47 if (hashMap[hashValue]) { 48 if (strHashValue.find(hashValue) == strHashValue.end()) { 49 rel.push_back(s.substr(pos - 9, 10)); 50 strHashValue.insert(hashValue); 51 } 52 } else { 53 hashMap[hashValue] = true; 54 } 55 } 56 57 return rel; 58 }