修改 Pattern代碼使 Java 正則表達式的group名稱支持下划線 '_'

本文轉載自查看原文 2017-09-07 23:03 2368 java/ 正則

為什么

由於工作是做數據ETL的，很多時候會使用到正則對數據進行提取，但是java的正則中的groupname不支持'_'，官方的文檔中是這樣的:

Group name

A capturing group can also be assigned a "name", a named-capturing group, and then be back-referenced later by the "name". Group names are composed of the following characters. The first character must be a letter.

The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'),
The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'),
The digits '0' through '9' ('\u0030' through '\u0039'),
A named-capturing group is still numbered as described in Group number.

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group.

可以看到，只支持大寫字母A-Z、小寫字母a-z、數字0-9

查找源代碼

在java.util.regex.Pattern類的以下源碼中(jdk1.8.141是2789行)有下面這個方法:

    /**
     * Parses and returns the name of a "named capturing group", the trailing
     * ">" is consumed after parsing.
     */
    private String groupname(int ch) {
        StringBuilder sb = new StringBuilder();
        sb.append(Character.toChars(ch));
        while (ASCII.isLower(ch=read()) || ASCII.isUpper(ch) ||
               ASCII.isDigit(ch)) {
            sb.append(Character.toChars(ch));
        }
        if (sb.length() == 0)
            throw error("named capturing group has 0 length name");
        if (ch != '>')
            throw error("named capturing group is missing trailing '>'");
        return sb.toString();
    }

可以看到，源代碼中對groupname的提取是一個while循環，當讀取到的字符是小寫字母(ASCII.isLower)、大寫字母(ASCII.isUpper)、數字(ASCII.isDigit)的時候，會把這個字符添加到StringBuilder中，然后讀取下個字符，知道不滿足這個條件。

修改源代碼

好，現在知道是這個原因了，怎么進行修改呢？
有很多人說不要修改大神寫的代碼，但是沒辦法。
由於不支持'_', 給工作帶來挺多其它麻煩的，比如數據庫中的字段名有'_'，如果正則組不支持下划線的話，就需要一個正則組名和列名的映射關系，或者不用正則組名，使用正則組下標0,1,2...來映射。比較繁瑣。
修改其實很簡單，由於Pattern這個類在源代碼中定義為final的，沒法直接繼承然后overwrite這個方法，就只能在自己的項目下新建一個regex包，將java.util.regex包的類都copy出來，總共是6個

修改Pattern的上述方法，'_'這個字符在ASCII中是95，所以添加一個判斷就可以了:

    private String groupname(int ch) {
        StringBuilder sb = new StringBuilder();
        sb.append(Character.toChars(ch));
        //TODO 增加了ch==95這個條件來支持正則組名支持下划線('_')，
        //源碼為java.util.regex.Pattern的2793行
        while (ASCII.isLower(ch=read()) || ASCII.isUpper(ch) ||
               ASCII.isDigit(ch) || ch == 95) {
            sb.append(Character.toChars(ch));
        }
        if (sb.length() == 0)
            throw error("named capturing group has 0 length name");
        if (ch != '>')
            throw error("named capturing group is missing trailing '>'");
        return sb.toString();
    }

這樣就可以使用我們自己Pattern類了，最后成功運行

public class MyTest {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\\s\\|\\s(?<my_name>worker_\\d+)\\s\\|");
        Matcher matcher = pattern.matcher("2017-02-14 23:58:04 | worker_10 | [ATMP05]");
        if (matcher.find()){
            //打印出來是"worker_10"
            System.out.println(matcher.group("my_name"));
        }
    }
}

最后，這個源碼值改了一小部分，但是卻讓工作輕松了
當然，這樣改是否會影響到其它東西需要時間的檢驗。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 一個正則表達式,只含有漢字、數字、字母、下划線,下划線位置不限【Z】字母、數字、下划線、符號等組合常用正則表達式寫出開頭匹配字母和下划線，末尾是數字的正則表達式字母數字下划線常用正則表達式正則表達式駝峰標示轉下划線正則表達式匹配中文數字字母下划線字母數字下划線常用正則表達式由數字、26個英文字母、下划線或漢字的正則表達式 C# 設置textedit只能輸入英文數字下划線，並且只能以英文開頭(正則表達式) 正則表達式驗證6~30位數字，下划線，中划線，字母任意兩種混合的密碼驗證策略