模式修飾符
下面列出了當前可用的 PCRE 修飾符。括號中提到的名字是 PCRE 內部這些修飾符的名稱。 模式修飾符中的空格,換行符會被忽略,其他字符會導致錯誤。
i (PCRE_CASELESS)
如果設置了這個修飾符,模式中的字母會進行大小寫不敏感匹配。
m (PCRE_MULTILINE)
默認情況下,PCRE 認為目標字符串是由單行字符組成的(然而實際上它可能會包含多行), "行首"元字符 (^) 僅匹配字符串的開始位置, 而"行末"元字符 ($) 僅匹配字符串末尾, 或者最后的換行符(除非設置了 D 修飾符)。這個行為和 perl 相同。 當這個修飾符設置之后,“行首”和“行末”就會匹配目標字符串中任意換行符之前或之后,另外, 還分別匹配目標字符串的最開始和最末尾位置。這等同於 perl 的 /m 修飾符。如果目標字符串 中沒有 "\n" 字符,或者模式中沒有出現 ^ 或 $,設置這個修飾符不產生任何影響。
s (PCRE_DOTALL)
如果設置了這個修飾符,模式中的點號元字符匹配所有字符,包含換行符。如果沒有這個 修飾符,點號不匹配換行符。這個修飾符等同於 perl 中的/s修飾符。 一個取反字符類比如 [^a] 總是匹配換行符,而不依賴於這個修飾符的設置。
x (PCRE_EXTENDED)
如果設置了這個修飾符,模式中的沒有經過轉義的或不在字符類中的空白數據字符總會被忽略, 並且位於一個未轉義的字符類外部的#字符和下一個換行符之間的字符也被忽略。 這個修飾符 等同於 perl 中的 /x 修飾符,使被編譯模式中可以包含注釋。 注意:這僅用於數據字符。 空白字符 還是不能在模式的特殊字符序列中出現,比如序列 (?( 引入了一個條件子組(譯注: 這種語法定義的 特殊字符序列中如果出現空白字符會導致編譯錯誤。 比如(?(就會導致錯誤)。
e (PREG_REPLACE_EVAL)
本特性已自 PHP 5.5.0 起廢棄。強烈建議不要使用本特性。
如果設置了這個被棄用的修飾符, preg_replace() 在進行了對替換字符串的 后向引用替換之后, 將替換后的字符串作為php 代碼評估執行(eval 函數方式),並使用執行結果 作為實際參與替換的字符串。單引號、雙引號、反斜線(\)和 NULL 字符在 后向引用替換時會被用反斜線轉義.
The addslashes() function is run on each matched backreference before the substitution takes place. As such, when the backreference is used as a quoted string, escaped characters will be converted to literals. However, characters which are escaped, which would normally not be converted, will retain their slashes. This makes use of this modifier very complicated.
請確保 replacement
參數由合法 php 代碼字符串組成,否則 php 將會 在preg_replace() 調用的行上產生一個解釋錯誤。
Use of this modifier is discouraged, as it can easily introduce security vulnerabilites:
<?php
$html = $_POST['html'];
// uppercase headings
$html = preg_replace(
'(<h([1-6])>(.*?)</h\1>)e',
'"<h$1>" . strtoupper("$2") . "</h$1>"',
$html
);
The above example code can be easily exploited by passing in a string such as <h1>{${eval($_GET[php_code])}}</h1>. This gives the attacker the ability to execute arbitrary PHP code and as such gives him nearly complete access to your server.
To prevent this kind of remote code execution vulnerability the preg_replace_callback() function should be used instead:
<?php
$html = $_POST['html'];
// uppercase headings
$html = preg_replace_callback(
'(<h([1-6])>(.*?)</h\1>)',
function ($m) {
return "<h$m[1]>" . strtoupper($m[2]) . "</h$m[1]>";
},
$html
);
Note:
僅 preg_replace() 使用此修飾符,其他 PCRE 函數忽略此修飾符。
A (PCRE_ANCHORED)如果設置了這個修飾符,模式被強制為"錨定"模式,也就是說約束匹配使其僅從 目標字符串的開始位置搜索。這個效果同樣可以使用適當的模式構造出來,並且 這也是 perl 種實現這種模式的唯一途徑。D (PCRE_DOLLAR_ENDONLY)如果這個修飾符被設置,模式中的元字符美元符號僅僅匹配目標字符串的末尾。如果這個修飾符 沒有設置,當字符串以一個換行符結尾時, 美元符號還會匹配該換行符(但不會匹配之前的任何換行符)。 如果設置了修飾符m,這個修飾符被忽略. 在 perl 中沒有與此修飾符等同的修飾符。S當一個模式需要多次使用的時候,為了得到匹配速度的提升,值得花費一些時間 對其進行一些額外的分析。如果設置了這個修飾符,這個額外的分析就會執行。當前, 這種對一個模式的分析僅僅適用於非錨定模式的匹配(即沒有單獨的固定開始字符)。U (PCRE_UNGREEDY)這個修飾符逆轉了量詞的"貪婪"模式。 使量詞默認為非貪婪的,通過量詞后緊跟? 的方式可以使其成為貪婪的。這和 perl 是不兼容的。 它同樣可以使用 模式內修飾符設置 (?U)進行設置, 或者在量詞后以問號標記其非貪婪(比如.*?)。
Note:
在非貪婪模式,通常不能匹配超過 pcre.backtrack_limit 的字符。
X (PCRE_EXTRA)這個修飾符打開了 PCRE 與 perl 不兼容的附件功能。模式中的任意反斜線后就 ingen 一個 沒有特殊含義的字符都會導致一個錯誤,以此保留這些字符以保證向后兼容性。 默認情況下,在 perl 中,反斜線緊跟一個沒有特殊含義的字符被認為是該字符的原文。 當前沒有其他特性由這個修飾符控制。J (PCRE_INFO_JCHANGED)內部選項設置(?J)修改本地的PCRE_DUPNAMES選項。允許子組重名, (譯注:只能通過內部選項設置,外部的 /J 設置會產生錯誤。)u (PCRE_UTF8)此修正符打開一個與 perl 不兼容的附加功能。 模式字符串被認為是utf-8的. 這個修飾符 從 unix 版php 4.1.0 或更高,win32版 php 4.2.3 開始可用。 php 4.3.5 開始檢查模式的 utf-8 合法性。

User Contributed Notes 6 notes
Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;
1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"
2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8
3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )
4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/
The following script should give you an idea of what works and what doesn't;
<?php
$examples = array(
'Valid ASCII' => "a",
'Valid 2 Octet Sequence' => "\xc3\xb1",
'Invalid 2 Octet Sequence' => "\xc3\x28",
'Invalid Sequence Identifier' => "\xa0\xa1",
'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",
'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
);
echo "++Invalid UTF-8 in pattern\n";
foreach ( $examples as $name => $str ) {
echo "$name\n";
preg_match("/".$str."/u",'Testing');
}
echo "++ preg_match() examples\n";
foreach ( $examples as $name => $str ) {
preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar);
echo "$name: ";
if ( count($ar) == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched {$ar[0]}\n";
}
}
echo "++ preg_match_all() examples\n";
foreach ( $examples as $name => $str ) {
preg_match_all('/./u', $str, $ar);
echo "$name: ";
$num_utf8_chars = count($ar[0]);
if ( $num_utf8_chars == 0 ) {
echo "Matched nothing!\n";
} else {
echo "Matched $num_utf8_chars character\n";
}
}
?>
The description of the "u" flag is a bit misleading. It suggests that it is only required if the pattern contains UTF-8 characters, when in fact it is required if either the pattern or the subject contain UTF-8. Without it, I was having problems with preg_match_all returning invalid multibyte characters when given a UTF-8 subject string.
It's fairly clear if you read the documentation for libpcre:
In order process UTF-8 strings, you must build PCRE to include UTF-8
support in the code, and, in addition, you must call pcre_compile()
with the PCRE_UTF8 option flag, or the pattern must start with the
sequence (*UTF8). When either of these is the case, both the pattern
and any subject strings that are matched against it are treated as
UTF-8 strings instead of strings of 1-byte characters.
[from http://www.pcre.org/pcre.txt]
If the _subject_ contains utf-8 sequences the 'u' modifier should be set, otherwise a pattern such as /./ could match a utf-8 sequence as two to four individual ASCII characters. It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier.
If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.
In case you're wondering, what is the meaning of "S" modifier, this paragraph might be useful:
When "S" modifier is set, PHP calls the pcre_study() function from the PCRE API before executing the regexp. Result from the function is passed directly to pcre_exec().
For more information about pcre_study() and "Studying the pattern" check the PCRE manual on http://www.pcre.org/pcre.txt
PS: Note that function names "pcre_study" and "pcre_exec" used here refer to PCRE library functions written in C language and not to any PHP functions.
Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them:
For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used:
preg_match('/[\x{2460}-\x{2468}]/u', $str);
Here $str is a haystack string
\x{hex} - is an UTF-8 hex char-code
and /u is used for identifying the class as a class of Unicode chars.
Hope, it'll be useful.
When adding comments with the /x modifier, don't use the pattern delimiter in the comments. It may not be ignored in the comments area. Example:
<?php
$target = 'some text';
if(preg_match('/
e # Comments here
/x',$target)) {
print "Target 1 hit.\n";
}
if(preg_match('/
e # /Comments here with slash
/x',$target)) {
print "Target 1 hit.\n";
}
?>
prints "Target 1 hit." but then generates a PHP warning message for the second preg_match():
Warning: preg_match() [function.preg-match]: Unknown modifier 'C' in /ebarnard/x-modifier.php on line 11