今天做練習的時候覺得題干給出的正則表達式不能正確檢測一些非法的電郵地址,所以打算自己動手寫一個。在做測試的時候,寫出了全新的BUG,並且由此開啟了對正則表達式及其匹配引擎的一些了解。
1.什么是回溯循環
引用regular-expression.info的說法,我稍作翻譯來解釋這個現象。
REF: https://www.regular-expressions.info/catastrophic.html
Consider the regular expression (x+x+)+y. Before you scream in horror and say this contrived example should be written as xx+y to match exactly the same without those terribly nested quantifiers: just assume that each "x" represents something more complex, with certain strings being matched by both "x". See the section on HTML files below for a real example.
思考一下下面的例子:(x+ x+ )+ y。在你大吼着驚叫出“這個詭異的表達式應該改為(x x+ y)來舍棄掉這些可怕的嵌套的表達式並完成相同的功能”之前,讓我們假設“x”代表更復雜的表達式,而且某些字符串被這兩個“x”反復匹配。下面是一個例子:
Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer explodes. Almost.
當你將上面的表達式應用於字符串 “ xxxxxxxxxxy ”的時候:
第一次匹配嘗試:第一個的“x+”匹配了全部的10個x,所以第二個x+匹配不到任何東西(+匹配≥1次所以匹配失敗)。本次匹配失敗,引擎回溯到字符串開頭;
第二次匹配嘗試:第一個的“x+”匹配了9個x,第二個x+匹配到1個x,第一次匹配組嘗試成功,開始嘗試匹配第二組,並且失敗。(由於 + 要求匹配一次或者多次)一次組匹配算作匹配成功,所以第一部分的匹配完成,剩下的y也被y匹配,整個表達式匹配成功並且交付給了你的客戶,然后他電腦炸了。
大概就是這樣。
The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails. Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to (6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.
上面的表達式在最后面的 y 缺失的時候,會便顯得非常糟糕。當y沒能被匹配的時候,表達式引擎回溯了:
注:(a,b)表示第一個x+匹配了a個x,第二個x+匹配了b個x。x[1] 表示第一個x+,而x[2]表示第二個x+
第一次匹配嘗試:(9,1),匹配不到y,引擎回溯,此時x[2]只匹配了1個x不能提供回溯,x[1]可以提供一個x來進行回溯;
第二次匹配嘗試:(8,2),匹配不到y,引擎回溯,此時x[2]匹配了2個x,可以提供回溯,於是x[2]提供一個x來進行回溯;
第三次匹配嘗試:(9,1),匹配不到y,引擎回溯,x[1]和x[2]都提供過1個x,這次x[2]提供2個x來回溯;
第四次匹配嘗試:(7,1),發現可以匹配第二組(1,1),結果匹配不到y,引擎回溯
后續:(6,4); (6,2)(1,1); (6,1)(2,1); (6,1)(1,2)……現在我想你明白發生什么了1吧。
If you try this regex on a 10x string in RegexBuddy's debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.
如果你用10個x和1個y的字符串來RegexBuddy's debugger網站做測試,你會發現一共用了2558次匹配引擎才能發現y缺失了。如果是11個x,它需要5118步;12個則是10238步。很顯然這是一個O(2n)復雜度的情形。當x數量達到21個的時候,引擎在了280萬步之后發現了災難性的回溯問題,死掉了2。
RegexBuddy is forgiving in that it detects it's going in circles, and aborts the match attempt. Other regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version 5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.
RegexBuddy 會自己發現死循環並放棄匹配嘗試,可是有些表達式引擎(比如.NET)會一直試下去,或者直接因為棧溢出崩潰掉(比如5.10版本以前的Perl)。在Win上發生堆棧溢出是很煩人的,因為你的應用程序會突然“無緣無故沒有理由地”閃退。所以在運營使用自定義正則表達式的Web服務的時候一定要小心,新手有驚人的寫出指數級別的復雜表達式的能力(譯者按:說的是我)。
1get the drift -- Understand the general meaning or purport
2Bow out -- To leave a job or stop doing an activity, usually after a long time
根據上述內容,我們可以得出一些結論:
1.正則表達式是回溯式判定的:當包含“+”這種“to unlimitied times”的符號時,一次判定到字符串結尾,表達式卻沒有判定完,引擎會回溯並減少先前的判定次數,重試判定。
2.沒有長時間循環限制的引擎會一直嘗試下去直到所有的判定組合都被嘗試(暴力測試啊!)
所以為了讓自己寫的表達式不要被奇怪的輸入玩壞,童鞋們一定要:
1.消除掉表達式中的嵌套量詞,可選項不要用+這種東西:比如([\w]+[\w.-]+) -->([\w][\w.-]+),這尤其重要!
使用一個只匹配一次的量詞可以避免災難性的回溯
2.[我還沒研究明白dalao想表達什么]
一個實際的案例(其實就是我寫的Bug):匹配電郵地址
Regex: ([A-Za-z0-9]+([\w.-]?[A-Za-z0-9]+)*)+@([\w-]+\.)+([A-Za-z0-9]+)
Testcase:ADU5dX-532.dx_aerx@gcore..com
災難性回溯發生在匹配完第一個“.”之后,由於 ([\w.-]?[A-Za-z0-9]+)*) 和 ([\w-]+\.) 都含有第二次匹配需要的“.”,所以引擎反復在這兩組匹配中回溯,在977945次嘗試后,系統中斷了嘗試。
更改方法:少用幾個“*”、“+”和“()”就能極大的減少匹配嘗試的次數。正則表達式里面的括號是有自己的含義的,不能隨意添加或刪減,會影響到匹配結果或是嘗試次數!
更正結果:([A-Za-z0-9]([\w.-]?[A-Za-z0-9])+)*)@([\w-]+\.)+([A-Za-z0-9]+)