原文:https://pythoncaff.com/docs/pymotw/shlex-parse-shell-style-syntaxes/171
這是一篇協同翻譯的文章,你可以點擊『我來翻譯』按鈕來參與翻譯。
目的:Shell 風格語句的語法解析。
shlex
模塊實現了能夠解析簡單的類似 Shell 文件語法結構的類。它可用於編寫特殊領域的語言,或解析被引用的字符串(這項任務比表面看上去更加復雜)。
解析引用字符串##
當我們輸入文本時,遇到的一個常見問題是識別由引用字符構成的序列並把它們當做一個單獨的實體。以引號分割文本有時並不能獲得預期的效果,尤其是當引用具有嵌套層次時。例如以下的文本:
This string has embedded "double quotes" and 'single quotes' in it, and even "a 'nested example'".
一種簡易的辦法是構建一個正則表達式以找出在引號外的部分文本並把它們與引號內的部分分離開來,或相反的過程。但其實現過程非常繁瑣,並且由於單引號和撇號易於混淆或是拼寫錯誤而經常引發錯誤。更好的解決方法是使用真正的語法解析器,如 shlex
模塊所提供的。下面是一個利用 shlex
類從輸入文本中識別出標記並打印出來的程序:
shlex_example.py
import shlex import sys if len(sys.argv) != 2: print('Please specify one filename on the command line.') sys.exit(1) filename = sys.argv[1] with open(filename, 'r') as f: body = f.read() print('ORIGINAL: {!r}'.format(body)) print() print('TOKENS:') lexer = shlex.shlex(body) for token in lexer: print('{!r}'.format(token))
當該程序用於包含引號的數據時,語法解析器會生成一個包含期望標記的列表。
$ python3 shlex_example.py quotes.txt ORIGINAL: 'This string has embedded "double quotes" and\n\'singl e quotes\' in it, and even "a \'nested example\'".\n' TOKENS: 'This' 'string' 'has' 'embedded' '"double quotes"' 'and' "'single quotes'" 'in' 'it' ',' 'and' 'even' '"a \'nested example\'"' '.'
孤立的引號,例如撇號被按同樣方法處置了。再看以下文本:
This string has an embedded apostrophe, doesn't it?
包含撇號的標記詞能夠被區分出來。
$ python3 shlex_example.py apostrophe.txt ORIGINAL: "This string has an embedded apostrophe, doesn't it?" TOKENS: 'This' 'string' 'has' 'an' 'embedded' 'apostrophe' ',' "doesn't" 'it' '?'
Making Safe Strings for Shells##
The quote()
function performs the inverse operation, escaping existing quotes and adding missing quotes for strings to make them safe to use in shell commands.
shlex_quote.py
import shlex
examples = [ "Embedded'SingleQuote", 'Embedded"DoubleQuote', 'Embedded Space', '~SpecialCharacter', r'Back\slash', ] for s in examples: print('ORIGINAL : {}'.format(s)) print('QUOTED : {}'.format(shlex.quote(s))) print()
It is still usually safer to use a list of arguments when using subprocess.Popen
, but in situations where that is not possible quote()
provides some protection by ensuring that special characters and white space are quoted properly.
$ python3 shlex_quote.py ORIGINAL : Embedded'SingleQuote QUOTED : 'Embedded'"'"'SingleQuote' ORIGINAL : Embedded"DoubleQuote QUOTED : 'Embedded"DoubleQuote' ORIGINAL : Embedded Space QUOTED : 'Embedded Space' ORIGINAL : ~SpecialCharacter QUOTED : '~SpecialCharacter' ORIGINAL : Back\slash QUOTED : 'Back\slash'
Embedded Comments##
Since the parser is intended to be used with command languages, it needs to handle comments. By default, any text following a #
is considered part of a comment and ignored. Due to the nature of the parser, only single-character comment prefixes are supported. The set of comment characters used can be configured through the commenters
property.
$ python3 shlex_example.py comments.txt ORIGINAL: 'This line is recognized.\n# But this line is ignored. \nAnd this line is processed.' TOKENS: 'This' 'line' 'is' 'recognized' '.' 'And' 'this' 'line' 'is' 'processed' '.'
Splitting Strings into Tokens##
To split an existing string into component tokens, the convenience function split()
is a simple wrapper around the parser.
shlex_split.py
import shlex
text = """This text has "quoted parts" inside it.""" print('ORIGINAL: {!r}'.format(text)) print() print('TOKENS:') print(shlex.split(text))
The result is a list.
$ python3 shlex_split.py ORIGINAL: 'This text has "quoted parts" inside it.' TOKENS: ['This', 'text', 'has', 'quoted parts', 'inside', 'it.']
Including Other Sources of Tokens##
The shlex
class includes several configuration properties that control its behavior. The source
property enables a feature for code (or configuration) re-use by allowing one token stream to include another. This is similar to the Bourne shell source
operator, hence the name.
shlex_source.py
import shlex
text = "This text says to source quotes.txt before continuing." print('ORIGINAL: {!r}'.format(text)) print() lexer = shlex.shlex(text) lexer.wordchars += '.' lexer.source = 'source' print('TOKENS:') for token in lexer: print('{!r}'.format(token))
The string "source quotes.txt
" in the original text receives special handling. Since the source
property of the lexer is set to "source"
, when the keyword is encountered, the filename appearing on the next line is automatically included. In order to cause the filename to appear as a single token, the .
character needs to be added to the list of characters that are included in words (otherwise "quotes.txt
" becomes three tokens, "quotes
", ".
", "txt
"). This what the output looks like.
$ python3 shlex_source.py ORIGINAL: 'This text says to source quotes.txt before continuing.' TOKENS: 'This' 'text' 'says' 'to' 'This' 'string' 'has' 'embedded' '"double quotes"' 'and' "'single quotes'" 'in' 'it' ',' 'and' 'even' '"a \'nested example\'"' '.' 'before' 'continuing.'
The source feature uses a method called sourcehook()
to load the additional input source, so a subclass of shlex
can provide an alternate implementation that loads data from locations other than files.
Controlling the Parser##
An earlier example demonstrated changing the wordchars
value to control which characters are included in words. It is also possible to set the quotes
character to use additional or alternative quotes. Each quote must be a single character, so it is not possible to have different open and close quotes (no parsing on parentheses, for example).
shlex_table.py
import shlex
text = """|Col 1||Col 2||Col 3|""" print('ORIGINAL: {!r}'.format(text)) print() lexer = shlex.shlex(text) lexer.quotes = '|' print('TOKENS:') for token in lexer: print('{!r}'.format(token))
In this example, each table cell is wrapped in vertical bars.
$ python3 shlex_table.py ORIGINAL: '|Col 1||Col 2||Col 3|' TOKENS: '|Col 1|' '|Col 2|' '|Col 3|'
It is also possible to control the whitespace characters used to split words.
shlex_whitespace.py
import shlex
import sys
if len(sys.argv) != 2: print('Please specify one filename on the command line.') sys.exit(1) filename = sys.argv[1] with open(filename, 'r') as f: body = f.read() print('ORIGINAL: {!r}'.format(body)) print() print('TOKENS:') lexer = shlex.shlex(body) lexer.whitespace += '.,' for token in lexer: print('{!r}'.format(token))
If the example in shlex_example.py
is modified to include period and comma, the results change.
$ python3 shlex_whitespace.py quotes.txt ORIGINAL: 'This string has embedded "double quotes" and\n\'singl e quotes\' in it, and even "a \'nested example\'".\n' TOKENS: 'This' 'string' 'has' 'embedded' '"double quotes"' 'and' "'single quotes'" 'in' 'it' 'and' 'even' '"a \'nested example\'"'
Error Handling##
When the parser encounters the end of its input before all quoted strings are closed, it raises ValueError
. When that happens, it is useful to examine some of the properties maintained by the parser as it processes the input. For example, infile
refers to the name of the file being processed (which might be different from the original file, if one file sources another). The lineno
reports the line when the error is discovered. The lineno
is typically the end of the file, which may be far away from the first quote. The token
attribute contains the buffer of text not already included in a valid token. The error_leader()
method produces a message prefix in a style similar to Unix compilers, which enables editors such as emacs
to parse the error and take the user directly to the invalid line.
shlex_errors.py
import shlex
text = """This line is ok. This line has an "unfinished quote. This line is ok, too. """ print('ORIGINAL: {!r}'.format(text)) print() lexer = shlex.shlex(text) print('TOKENS:') try: for token in lexer: print('{!r}'.format(token)) except ValueError as err: first_line_of_error = lexer.token.splitlines()[0] print('ERROR: {} {}'.format(lexer.error_leader(), err)) print('following {!r}'.format(first_line_of_error))
The example produces this output.
$ python3 shlex_errors.py ORIGINAL: 'This line is ok.\nThis line has an "unfinished quote. \nThis line is ok, too.\n' TOKENS: 'This' 'line' 'is' 'ok' '.' 'This' 'line' 'has' 'an' ERROR: "None", line 4: No closing quotation following '"unfinished quote.'
POSIX vs. Non-POSIX Parsing##
The default behavior for the parser is to use a backwards-compatible style that is not POSIX-compliant. For POSIX behavior, set the posix
argument when constructing the parser.
shlex_posix.py
import shlex
examples = [ 'Do"Not"Separate', '"Do"Separate', 'Escaped \e Character not in quotes', 'Escaped "\e" Character in double quotes', "Escaped '\e' Character in single quotes", r"Escaped '\'' \"\'\" single quote", r'Escaped "\"" \'\"\' double quote', "\"'Strip extra layer of quotes'\"", ] for s in examples: print('ORIGINAL : {!r}'.format(s)) print('non-POSIX: ', end='') non_posix_lexer = shlex.shlex(s, posix=False) try: print('{!r}'.format(list(non_posix_lexer))) except ValueError as err: print('error({})'.format(err)) print('POSIX : ', end='') posix_lexer = shlex.shlex(s, posix=True) try: print('{!r}'.format(list(posix_lexer))) except ValueError as err: print('error({})'.format(err)) print()
Here are a few examples of the differences in parsing behavior.
$ python3 shlex_posix.py ORIGINAL : 'Do"Not"Separate' non-POSIX: ['Do"Not"Separate'] POSIX : ['DoNotSeparate'] ORIGINAL : '"Do"Separate' non-POSIX: ['"Do"', 'Separate'] POSIX : ['DoSeparate'] ORIGINAL : 'Escaped \\e Character not in quotes' non-POSIX: ['Escaped', '\\', 'e', 'Character', 'not', 'in', 'quotes'] POSIX : ['Escaped', 'e', 'Character', 'not', 'in', 'quotes'] ORIGINAL : 'Escaped "\\e" Character in double quotes' non-POSIX: ['Escaped', '"\\e"', 'Character', 'in', 'double', 'quotes'] POSIX : ['Escaped', '\\e', 'Character', 'in', 'double', 'quotes'] ORIGINAL : "Escaped '\\e' Character in single quotes" non-POSIX: ['Escaped', "'\\e'", 'Character', 'in', 'single', 'quotes'] POSIX : ['Escaped', '\\e', 'Character', 'in', 'single', 'quotes'] ORIGINAL : 'Escaped \'\\\'\' \\"\\\'\\" single quote' non-POSIX: error(No closing quotation) POSIX : ['Escaped', '\\ \\"\\"', 'single', 'quote'] ORIGINAL : 'Escaped "\\"" \\\'\\"\\\' double quote' non-POSIX: error(No closing quotation) POSIX : ['Escaped', '"', '\'"\'', 'double', 'quote'] ORIGINAL : '"\'Strip extra layer of quotes\'"' non-POSIX: ['"\'Strip extra layer of quotes\'"'] POSIX : ["'Strip extra layer of quotes'"]
See also#
- 解析引用字符串#
- Making Safe Strings for Shells#
- Embedded Comments#
- Splitting Strings into Tokens#
- Including Other Sources of Tokens#
- Controlling the Parser#
- Error Handling#
- POSIX vs. Non-POSIX Parsing#
#
- Standard library documentation for shlex
cmd
-- Tools for building interactive command interpreters.argparse
-- Command line option parsing.subprocess
-- Run commands after parsing the command line.
本文中的所有譯文僅用於學習和交流目的,轉載請務必注明文章譯者、出處、和本文鏈接
我們的翻譯工作遵照 CC 協議,如果我們的工作有侵犯到您的權益,請及時聯系我們。