【Java字符序列】Pattern

本文轉載自查看原文 2018-07-23 23:58 11173 Java/ 字符序列/ 源碼

簡介

Pattern，正則表達式的編譯表示，操作字符序列的利器。

整個Pattern是一個樹形結構(對應於表達式中的‘|’)，一般為鏈表結構，樹(鏈表)的基本元素是Node結點，Node有各種各樣的子結點，以滿足不同的匹配模式。

樣例1

以一個最簡單的樣例，走進源碼。

1     public static void example() {
2         String regex = "EXAMPLE";
3         String text = "HERE IS A SIMPLE EXAMPLE";
4         Pattern pattern = Pattern.compile(regex, Pattern.LITERAL);
5         Matcher matcher = pattern.matcher(text);
6         matcher.find();
7     }

這個樣例實現了查找字串的功能。

Pattern.compile(String regex)

1     public static Pattern compile(String regex) {
2         return new Pattern(regex, 0);
3     }

這個方法通過調用構造方法返回一個Pattern對象。

構造方法

 1     private Pattern(String p, int f) {
 2         pattern = p;
 3         flags = f;
 4 
 5         if ((flags & UNICODE_CHARACTER_CLASS) != 0)
 6             flags |= UNICODE_CASE;
 7 
 8         capturingGroupCount = 1;
 9         localCount = 0;
10 
11         if (pattern.length() > 0) {
12             compile();
13         } else {
14             root = new Start(lastAccept);
15             matchRoot = lastAccept;
16         }
17     }

構造方法又調用compile()方法。

compile()

 1     private void compile() {
 2         if (has(CANON_EQ) && !has(LITERAL)) {
 3             normalize(); // 標准化
 4         } else {
 5             normalizedPattern = pattern;
 6         }
 7         patternLength = normalizedPattern.length();
 8 
 9         temp = new int[patternLength + 2]; // 將pattern字符的代碼點(codePoint)存在int數組中，多出2個槽，標識結束
10 
11         hasSupplementary = false;
12         int c, count = 0;
13         for (int x = 0; x < patternLength; x += Character.charCount(c)) {
14             c = normalizedPattern.codePointAt(x);
15             if (isSupplementary(c)) { // 確定指定的代碼點是否為輔助字符或未配對的代理
16                 hasSupplementary = true;
17             }
18             temp[count++] = c; // 存到數組中
19         }
20 
21         patternLength = count; // 現在是代碼點的個數
22 
23         if (!has(LITERAL))
24             RemoveQEQuoting(); // 處理\Q...\E的情況
25 
26         buffer = new int[32]; // 分配臨時對象
27         groupNodes = new GroupHead[10]; // 組
28         namedGroups = null;
29 
30         if (has(LITERAL)) { // 純文本，示例會走這個分支
31             matchRoot = newSlice(temp, patternLength, hasSupplementary); // Slice結點
32             matchRoot.next = lastAccept;
33         } else {
34             matchRoot = expr(lastAccept); // 遞歸解析表達式
35             if (patternLength != cursor) { // 處理異常情況
36                 if (peek() == ')') {
37                     throw error("Unmatched closing ')'");
38                 } else {
39                     throw error("Unexpected internal error");
40                 }
41             }
42         }
43 
44         if (matchRoot instanceof Slice) { // 如果是文本模式，則返回BnM結點(Boyer Moore算法，處理子字符串的高效算法)
45             root = BnM.optimize(matchRoot);
46             if (root == matchRoot) {
47                 root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot); // Start和LastNode(lastAccept)是首尾兩個結點，通用處理
48             }
49         } else if (matchRoot instanceof Begin || matchRoot instanceof First) { // Begin和End也是結點類型，大概是處理多行模式，不展開討論
50             root = matchRoot;
51         } else {
52             root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
53         }
54         // 清理工作
55         temp = null;
56         buffer = null;
57         groupNodes = null;
58         patternLength = 0;
59         compiled = true;
60     }

首先標准化表達式
將字符代碼點暫存int數組中，所謂代碼點指的是字符集里每個字符的編號，從0開始，常見的字符集ASCII和Unicode
返回相應類型的結點
root和matchRoot的關系，root表示可以從給定文本的任意位置開始查找，matchRoot表示全字符匹配(從頭到尾)

先看正則表達式是文本的分支，即樣例中所示。

newSlice(int[] buf, int count, boolean hasSupplementary)

 1     private Node newSlice(int[] buf, int count, boolean hasSupplementary) {
 2         int[] tmp = new int[count];
 3         if (has(CASE_INSENSITIVE)) {
 4             if (has(UNICODE_CASE)) {
 5                 for (int i = 0; i < count; i++) {
 6                     tmp[i] = Character.toLowerCase(Character.toUpperCase(buf[i]));
 7                 }
 8                 return hasSupplementary ? new SliceUS(tmp) : new SliceU(tmp);
 9             }
10             for (int i = 0; i < count; i++) {
11                 tmp[i] = ASCII.toLower(buf[i]);
12             }
13             return hasSupplementary ? new SliceIS(tmp) : new SliceI(tmp);
14         }
15         for (int i = 0; i < count; i++) {
16             tmp[i] = buf[i];
17         }
18         return hasSupplementary ? new SliceS(tmp) : new Slice(tmp);
19     }

該方法主要處理了一些情況，比如是否關心大小寫等，直接看最后一句，根據hasSupplementary的值決定初始化SliceS還是Slice，在此只關心Slice的情況。

數據結構Slice

 1     static final class Slice extends SliceNode {
 2         Slice(int[] buf) {
 3             super(buf);
 4         }
 5 
 6         boolean match(Matcher matcher, int i, CharSequence seq) {
 7             int[] buf = buffer;
 8             int len = buf.length;
 9             for (int j = 0; j < len; j++) { // 從第一個字符開始比較，如果長度不等，或遇到不等的字符，返回false，否則調用next結點的match方法
10                 if ((i + j) >= matcher.to) {
11                     matcher.hitEnd = true;
12                     return false;
13                 }
14                 if (buf[j] != seq.charAt(i + j))
15                     return false;
16             }
17             return next.match(matcher, i + len, seq);
18         }
19     }

該類繼承了SliceNode，主要實現了match方法，該方法查看給定文本是否與給定表達式相等，從頭開始一個字符一個字符地比較。

SliceNode

 1     static class SliceNode extends Node {
 2         int[] buffer;
 3         SliceNode(int[] buf) {
 4             buffer = buf;
 5         }
 6         boolean study(TreeInfo info) {
 7             info.minLength += buffer.length;
 8             info.maxLength += buffer.length;
 9             return next.study(info);
10         }
11     }

所有Slice結點的基類，實現了Node結點，主要的study方法，累加TreeInfo的最小長度和最大長度。

Node

 1     static class Node extends Object {
 2         Node next;
 3 
 4         Node() {
 5             next = Pattern.accept;
 6         }
 7 
 8         boolean match(Matcher matcher, int i, CharSequence seq) {
 9             matcher.last = i;
10             matcher.groups[0] = matcher.first; // 默認是一組(組[0-1])
11             matcher.groups[1] = matcher.last;
12             return true;
13         }
14 
15         boolean study(TreeInfo info) { // 零長度斷言
16             if (next != null) {
17                 return next.study(info);
18             } else {
19                 return info.deterministic;
20             }
21         }
22     }

頂級結點，match方法總是返回true，子類應重寫此方法，

group, 調用鏈如下：getSubSequence(groups[group * 2], groups[group * 2 + 1]) ---> CharSequence#subSequence(int start, int end).

每2個相鄰的元素表示一個組的首尾索引。

再回到compile方法，下一步調用BnM.optimize(matchRoot).

BnM

繼承Node結點

1     static class BnM extends Node {}

屬性

1         int[] buffer; // 表達式數組(里面元素是代碼點)
2         int[] lastOcc; // 壞字符，表達式里的每個字符按順序（從表達式數組索引0開始）存到lastOcc數組中，存的位置是表達式元素的值對128取模，因為它的長度是128，存的值是patternLength - 移動步長
3         int[] optoSft; // 好后綴，長度等於表達式數組的長度，里面的元素也表示patternLength - 移動步長

構造方法

1         BnM(int[] src, int[] lastOcc, int[] optoSft, Node next) {
2             this.buffer = src;
3             this.lastOcc = lastOcc;
4             this.optoSft = optoSft;
5             this.next = next;
6         }

optimize(Node node)

 1         static Node optimize(Node node) {
 2             if (!(node instanceof Slice)) {
 3                 return node;
 4             }
 5 
 6             int[] src = ((Slice) node).buffer;
 7             int patternLength = src.length;
 8             if (patternLength < 4) {
 9                 return node;
10             }
11             int i, j, k; // k無用
12             int[] lastOcc = new int[128];
13             int[] optoSft = new int[patternLength];
14             for (i = 0; i < patternLength; i++) { // 構造壞字符數組
15                 lastOcc[src[i] & 0x7F] = i + 1; // 如果不同的字符存在了同一個索引上，則上一個字符沿用后一個字符的【被減步數】,比原來的大了，所以總的步長小了，便不會錯過，而壞字符數組的規模則控制在了前128位，拿時間換空間是值得的，畢竟涵蓋了整個ASCII字符集
16             }
17             NEXT: for (i = patternLength; i > 0; i--) { // 構造好后綴數組
18                 for (j = patternLength - 1; j >= i; j--) { // 從后往前，處理所有子字符串的情況，出現的子字符串同時也在頭部出現才算有效
19                     if (src[j] == src[j - i]) {
20                         optoSft[j - 1] = i;
21                     } else {
22                         continue NEXT;
23                     }
24                 }
25                 while (j > 0) { // 填充剩余的槽位
26                     optoSft[--j] = i;
27                 }
28             }
29             optoSft[patternLength - 1] = 1;
30             if (node instanceof SliceS)
31                 return new BnMS(src, lastOcc, optoSft, node.next);
32             return new BnM(src, lastOcc, optoSft, node.next);
33         }

預處理，構造出壞字符數組和好后綴數組。

 1         boolean match(Matcher matcher, int i, CharSequence seq) {
 2             int[] src = buffer;
 3             int patternLength = src.length;
 4             int last = matcher.to - patternLength;
 5 
 6             NEXT: while (i <= last) {
 7                 for (int j = patternLength - 1; j >= 0; j--) { // 從后往前比較字符
 8                     int ch = seq.charAt(i + j);
 9                     if (ch != src[j]) {
10                         i += Math.max(j + 1 - lastOcc[ch & 0x7F], optoSft[j]); // 每次移動步長，取壞字符和好后綴中較大者
11                         continue NEXT;
12                     }
13                 }
14                 matcher.first = i;
15                 boolean ret = next.match(matcher, i + patternLength, seq);
16                 if (ret) {
17                     matcher.first = i;
18                     matcher.groups[0] = matcher.first; // 默認一組（兩個索引確定一個片段，所以只需2個元素）
19                     matcher.groups[1] = matcher.last;
20                     return true;
21                 }
22                 i++;
23             }
24             matcher.hitEnd = true;
25             return false;
26         }

根據Boyer Moore算法比較子字符串。

study

1         boolean study(TreeInfo info) {
2             info.minLength += buffer.length;
3             info.maxValid = false;
4             return next.study(info);
5         }

Boyer Moore算法

可參考這個。

該算法最主要的特征是，從右往左匹配，這樣每次可以移動不止一個字符，有兩個依據，壞字符和好后綴，取較大值。

壞字符

從表達式最右邊的字符開始與文本中同索引字符比較，若相同則繼續往左，直至比較結束，即匹配；或遇到不等的字符，即稱該不等字符(文本中的字符)為壞字符，根據表達式中是否包含壞字符和壞字符的位置來確定移動步長，公式如下：

后移位數 = 壞字符的位置 - 搜索詞中的上一次出現位置

如果"壞字符"不包含在搜索詞之中，則上一次出現位置為 -1。

好后綴

從右往左比較過程中，相等的部分字符序列稱為好后綴，最長好后綴的子序列也是好后綴，同時在表達式頭部出現的好后綴才有效。公式如下：

后移位數 = 好后綴的位置 - 搜索詞中的上一次出現位置

"好后綴"的位置以最后一個字符為准。

分析

其實，不管是壞字符還是好后綴，它的目的是移動最大步長，以實現快速匹配字符串的，還得不影響正確性。

壞字符很好理解，如果表達式中不包含壞字符，這個時候移動的步長是表達式的長度，也是能移動的最大長度；假如這種情況下，移動的長度小於表達式的長度，那么上次的壞字符總能再次出現，結果還是不匹配，所以直接移動到壞字符的后面，即表達式長度。

若是表達式中包含壞字符呢，肯定是的表達式中的那個字符和壞字符對齊才行，若是不對齊，與別的字符比較，還是不等，那如果表達式中包含不只一個呢，為了不往回(左)移動，應該使得表達式中靠后的字符與壞字符對齊，這樣如果不匹配的話，可以接着右移，避免回溯。

好后綴也好理解，如果頭部不包含好后綴，那么完全可以移動表達式的長度，若是包含，只需將好后綴部分對齊即可。

Node鏈

matches()

matchRoot -> Slice -> LastNode -> Node

Slice和Node結點，前面已經介紹過了。Slice結點，從第一個字符開始比較，如果長度不等，或遇到不等的字符，返回false，否則調用next結點的match方法，這里的next結點是LastNode.

Node結點的match方法總會返回true.

LastNode

 1     static class LastNode extends Node {
 2         boolean match(Matcher matcher, int i, CharSequence seq) {
 3             if (matcher.acceptMode == Matcher.ENDANCHOR && i != matcher.to) // 當acceptMode是ENDANCHOR時，此時是全匹配，所以需要檢查i是否是最后一個字符的下標
 4                 return false;
 5             matcher.last = i;
 6             matcher.groups[0] = matcher.first;
 7             matcher.groups[1] = matcher.last;
 8             return true;
 9         }
10     }

此結點是通用結點，用來最后檢測結果的，注意accetMode參數，用以區分是全匹配還是部分匹配。

find()

root -> BnM -> LastNode -> Node

由BnM結點可知，匹配可從任意有效位置開始，其實就是查找子字符串，且acceptMode不是ENDANCHOR，所以在LastNode中，無需檢查i是否指向最后一個字符。

以上結點均已在上文中給出。

樣例2

1     public static void example() {
2         String regex = "\\d+";
3         String text = "0123456789";
4         Pattern pattern = Pattern.compile(regex);
5         Matcher matcher = pattern.matcher(text);
6         matcher.find();
7     }

這個樣例是匹配數字。

跟蹤其調用過程，跟樣例1差不多，最后是到compile方法里面，調用expr(Node end) 方法。

expr(Node end)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java根據#號截取字符串，使用Pattern的方法 java的Pattern類 Java Pattern和Matcher用法 Java API —— Pattern類 java之Pattern類詳解 Java Pattern和Matcher用法 java Pattern和Matcher詳解 Java Pattern Matcher 正則表達式需要轉義的字符（一）Java Pattern類----java正則字符序列