python字符串和文本操作

本文轉載自查看原文 2016-01-28 15:20 4044 Python

1.需要將一個字符串切割為多個字段，分隔符並不是固定不的(比如空格個數不確定)

這時就不能簡單的使用string對象的split()方法，需要使用更加靈活的re.split()方法

>>> line = 'adaead jilil; sese, lsls,aea, foo'
>>> import re
>>> re.split(r'[;,\s]\s*',line)
['adaead', 'jilil', 'sese', 'lsls', 'aea', 'foo']
>>>

其中\s指匹配任何空白符,\S是\s的反義 *代表0次或多次；任何逗號、分號、空格，並且后面可以再緊跟任意個空格。會返回一個列表。和str.split()返回值類型一樣。

當使用re.split()函數時，如果正則表達式中包含一個括號捕獲分組，那么被匹配的文本（即分隔符）也將出現在結果列表中，如下：

>>> fields = re.split(r'(;|,|\s)\s*',line)
>>> fields
['adaead', ' ', 'jilil', ';', 'sese', ',', 'lsls', ',', 'aea', ',', 'foo']
>>>

獲取分隔字符在某些情況下也是有用的，這樣可以重要構造一個新的輸出字符串：

>>> values = fields[::2]
>>> delimiters = fields[1::2] + ['']
>>> values
['adaead', 'jilil', 'sese', 'lsls', 'aea', 'foo']
>>> delimiters
[' ', ';', ',', ',', ',', '']
>>> line
'adaead jilil; sese, lsls,aea, foo'
>>> ''.join(v+d for v,d in zip(values,delimiters))
'adaead jilil;sese,lsls,aea,foo'
>>>

以上是通過步長獲取分隔字符

也同樣可以不以分組正則表達式，而不保存分組分隔符，使用如下形式：(?:...)

>>> line
'adaead jilil; sese, lsls,aea, foo'
>>> re.split(r'(?:,|;|\s)\s*',line)
['adaead', 'jilil', 'sese', 'lsls', 'aea', 'foo']
>>>

2.字體串開頭或結尾匹配

可以簡單的使用str.startswith()或者str.endswith()方法

>>> import os
>>> files = os.listdir('./')
>>> if any(filename.endswith('.py') for filename in files):
...     print('That`s python file.')
... else:
...     print('There`s not python file exists.')
... 
That`s python file.
>>> files
['tsTserv.py']
>>> 
>>> [filename for filename in files if filename.endswith(('.py','.txt'))]
['tsTserv.py','locked_account.txt']
>>>

如下例子，說明此方法必須要以一個元組作為參數，否則會報錯：

>>> from urllib.request import urlopen
>>> def read_data(name):
...     if name.startswith(('http:','https:','ftp:')):
...         return urlopen(name).read()
...     else:
...         with open(name) as f:
...             return f.read()
... 
>>> read_data('http://www.baidu.com')
>>> choices = ['http:','ftp:']
>>> url = 'http://www.python.org'
>>> url.startswith(choices)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: startswith first arg must be str or a tuple of str, not list
>>> 
>>> url.startswith(tuple(choices))
True
>>>

其它配置的開關和結尾的方法：

>>> filename = 'helloworld.py'
>>> filename[-3:] == '.py'
True
>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True
>>> 
>>> import re
>>> url
'http://www.python.org'
>>> re.match('http:|https:|ftp:',url)
<_sre.SRE_Match object; span=(0, 5), match='http:'>
>>>

3.字符匹配和搜索

正常的可以使用str.find(),str.startswith(),str.endswith();或者re模塊

>>> text = 'yes no aggree not aggree'
>>> text1 = '2016-01-31'
>>> text2 = 'jan 31,2016'
>>> text == 'yes'
False
>>> text.find('no')
4
>>> if re.match(r'\d+-\d+-\d+',text1):
...     print('yes')
... else:
...     print('no')
... 
yes
>>> if re.match(r'\d+-\d+-\d+',text2):
...     print('yes')
... else:
...     print('no')
... 
no
>>> 對一個模式多次匹配 
>>> datepatt = re.compile(r'\d+-\d+-\d+')
>>> if datepatt.match(text1):
...     print('yes')
... else:
...     print('no')
... 
yes
>>> datepatt.match(text2)
>>> print(datepatt.match(text2))
None
>>> print(datepatt.match(text1))
<_sre.SRE_Match object; span=(0, 10), match='2016-01-31'>
>>>

match()總是從字符串開始去匹配，匹配到就返回；findall()返回所有匹配到的記錄

在定義正則時，通常會使用捕獲分組如：

datepat = re.compile(r'(\d+)-(\d+)-(\d+)')

捕獲分組可以使得后面的處理更加簡單，因為可以分別將每個組的內容提取出來。

>>> datepat = re.compile(r'(\d+)-(\d+)-(\d+)')
>>> m = datepat.match('2016-01-28')
>>> m
<_sre.SRE_Match object; span=(0, 10), match='2016-01-28'>
>>> m.group(0)
'2016-01-28'
>>> m.group(1)
'2016'
>>> m.group(2)
'01'
>>> m.group(3)
'28'
>>> m.groups()
('2016', '01', '28')
>>> year,month,day = m.groups()
>>> print(year,month,day)
2016 01 28
>>> text = 'today is 2016-01-28. lesson start 2016-01-01'
>>> datepat.findall(text)
[('2016', '01', '28'), ('2016', '01', '01')]
>>> for year,month,day in datepat.findall(text):
...     print('{}-{}-{}'.format(year,month,day))
... 
2016-01-28
2016-01-01
>>>

findall()方法會搜索文本並以列表形式返回所有的匹配，如果你想以迭代方式返回匹配，可以使用finditer()方法

>>> datepat.findall(text)
[('2016', '01', '28'), ('2016', '01', '01')]
>>> for m in datepat.finditer(text):
...     print(m.groups())
... 
('2016', '01', '28')
('2016', '01', '01')
>>>

4.字符串搜索和替換

如何在字符串找到匹配的模式再替換，簡單的可以使用str.replace()方法。復雜的可以使用re.sub()函數

>>> text
'today is 2016-01-28. lesson start 2016-01-01'
>>> re.sub(r'(\d+)-(\d+)-(\d+)',r'\2/\3/\1',text)
'today is 01/28/2016. lesson start 01/01/2016'
>>>

sub()函數中第一個參數是被匹配的模式，第二個參數是替換模式。反斜杠數字比如\3指向前面模式的捕獲組號。如果要多少匹配，可以先編譯它來提升性能。

對於更復雜的替換，可以傳遞一個替換回調函數來代替，回調函數的參數是一個match對象，也就是match()/find()返回的對象。如果想知道有多少替換發生了，可以使用re.subn()函數：

>>> datepat = re.compile(r'(\d+)-(\d+)-(\d+)')
>>> m = datepat.match('2016-01-28')
>>> m
<_sre.SRE_Match object; span=(0, 10), match='2016-01-28'>
>>> m.group(0)
'2016-01-28'
>>> m.groups()
('2016', '01', '28') 
>>> 
>>> text = 'today is 2016-01-28. lesson start 2016-01-01' 
>>> def change_date(m):
...     mon_name = month_abbr[int(m.group(2))]
...     return '{} {} {}'.format(m.group(3),mon_name,m.group(1))
... 
>>> from calendar import month_abbr
>>> datepat.sub(change_date,text)
'today is 28 Jan 2016. lesson start 01 Jan 2016'
>>>獲取更新的個數
>>> newtext, n = datepat.subn(r'\3/\2/\1',text)
>>> newtext
'today is 28/01/2016. lesson start 01/01/2016'
>>> n
2
>>>

忽略大小寫搜索替換

>>> import re
>>> text4 = 'PYTHON, pYTHON,Python python'
>>> re.findall('python',text4,flags=re.IGNORECASE)
['PYTHON', 'pYTHON', 'Python', 'python']
>>> 
>>> re.sub('python','snake',text4,flags=re.IGNORECASE)
'snake, snake,snake snake'
>>>

最短匹配模式

比如想匹配字符串雙引號之前的內容，有時可能匹配的結果不是想要的，因為*號的匹配是貪婪匹配

>>> text1 = 'you says "no."'
>>> str_pat = re.compile(r'\"(.*)\"')
>>> str_pat.findall(text1)
['no.']
>>> text2 = 'you says "no.", I say "yes."'
>>> str_pat.findall(text2)
['no.", I say "yes.']
>>>

這時候要使用?修飾符，讓其以最短模式匹配：

>>> str_pat = re.compile(r'\"(.*?)\"')
>>> str_pat.findall(text2)
['no.', 'yes.']
>>>

.號匹配除換行外的任何單個字符，通常在*/+這樣的操作符后添加一個?，可以強制匹配算法改成尋找最短的可能匹配。

多行匹配模式

.號不能匹配換行，可以使用如下方法實現：

>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a 
... multiline comment */
... '''
>>> comment = re.compile(r'/\*(.*?)\*/')
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>> #增加對換行的支持
... 
>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/')
>>> 
>>> comment.findall(text2)
[' this is a \nmultiline comment ']
>>>

其中(.*?)代表只匹配兩個*號之前的短模式匹配，(?:.|\n)*? 不捕獲分隔符的短模式匹配，且把換行也當成捕獲分隔符。

或者使用re.DOTALL，它可以讓正則表達式中的點(.)匹配包括換行符在內的任意字符。如：

>>> comment = re.compile(r'/\*(.*?)\*/',re.DOTALL)
>>> comment.findall(text2)
[' this is a \nmultiline comment ']
>>>

但是最好定義自己的正則表達式，這樣在不需要額外的標記參數下也能工作的很好。

將Unicode文本標准化

可以使用unicodedata模塊先將文本標准化，后再比較。其中normalize()的第一個參數指定字符串標准化的方式。NFC表示字符應該是整體組成（比如可能的話使用單一編碼），NFD表示字符應該分解為多個組合字符表示。

>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1 == s2
False
>>> len(s1)
14
>>> len(s2)
15
>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC',s1)
>>> t2 = unicodedata.normalize('NFC',s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalape\xf1o'
>>> print(ascii(t2))
'Spicy Jalape\xf1o'
>>> 
>>> t1
'Spicy Jalapeño'
>>> 
>>> t1 = unicodedata.normalize('NFD',s1)
>>> t1
'Spicy Jalapeño'
>>> ''.join(c for c in t1 if not unicodedata.combining(c))
'Spicy Jalapeno'
>>>

刪除字符串中不需要的字符

可以刪除開關、結尾、中間的字符，如空白符;其中strip()方法能用於刪除開始或結尾的字符，不會對中間的字符做任何操作。lstrip()和rstrip()分別從左各從右執行刪除操作。默認情況下，會自動刪除空白字符，但可以指定其它字符;

刪除中間的字符可以使用replace,re.sub等：

>>> t = '---------hello========'
>>> t.lstrip('-')
'hello========'
>>> t.rstrip('=')
'---------hello'
>>> t.strip('-=')
'hello'
>>> 
>>> s = s.strip()
>>> s
'Hello World'
>>> 
>>> s = '  Hello    World \n'
>>> s.lstrip()
'Hello    World \n'
>>> s.rstrip()
'  Hello    World'
>>> #替換
... 
>>> s.replace('   ','')
'  Hello World \n'
>>> 
>>> import re
>>> re.sub('\s+',' ',s)
' Hello World '
>>>

審查清理文本字符串

有時候用戶注冊時，會輸出變音符，比如 'pýtĥöñ\fis\tawesome\r\n' ，這樣可以使用str.translate()方法，去除變音符

如下，通過使用dict.fromkeys()方法構造一個字典，每個unicode和音符作為鍵，對應的值全部為None，然后使用unicodedata.normalize()將原始輸入標准化為分解形式字符。然后再調用translate函數刪除所有的變音符

>>> remap = {
...     ord('\t') : ' ',
...     ord('\f') : ' ',
...     ord('\r') : None
... }
>>> s = 'pýtĥöñ\fis\tawesome\r\n'
>>> a = s.translate(remap)
>>> a
'pýtĥöñ is awesome\n'
>>> #空白符\t ,\f已經被映射替換了
... 
>>> import unicodedata
>>> import sys
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))
>>> b = unicodedata.normalize('NFD',a)
>>> b
'pýtĥöñ is awesome\n'
>>> b.translate(cmb_chrs)
'python is awesome\n'
>>>

這里將所有unicode數字字符映射到對應的ASCII字符上的表格：

>>> import sys
>>> import unicodedata
>>> digitmap = {c:ord('0') + unicodedata.digit(chr(c)) for c in range(sys.maxunicode) if unicodedata.category(chr(c)) == 'Nd'}
>>> len(digitmap)
460
>>> x = '\u0661\u0662\u0663'
>>> x.translate(digitmap)
'123'

另外一種清理文本的技術涉及到I/O解碼與編碼函數。這里的思路是先對文本做一些初步的清理，然后再結合encode()/decode()操作來清除或修改它。

>>> a = 'pýtĥöñ is awesome\n'
>>> b = unicodedata.normalize('NFD',a)
>>> b
'pýtĥöñ is awesome\n'
>>> b.encode('ascii','ignore').decode('ascii')
'python is awesome\n'
>>>
def clean_spaces(s):
s = s.replace('\r', '')
s = s.replace('\t', ' ')
s = s.replace('\f', ' ')
return

這里的標准化操作作將原來的文本分解為單獨的和音符，接下來ASCII編碼/解碼只是簡單的一下子丟棄掉那些字符。這種方法僅僅只在最后的目標是獲取到文本對應ACSII表示的時候生效。

字符串對齊

簡單的可以使用字符串ljust()/rjust()/center() ; 或者format()，只需要使用>,< ,^字符后面緊跟一個指定的寬度。

>>> text = 'Hello World'
>>> text.ljust(20)
'Hello World         '
>>> text.rjust(20)
'         Hello World'
>>> text.center(20)
'    Hello World     '
>>> text.rjust(20,'='
... )
'=========Hello World'
>>> text.center(20,'-')
'----Hello World-----'
>>> 
>>> format(text,'>20')
'         Hello World'
>>> format(text,'<20')
'Hello World         '
>>> format(text,'^20')
'    Hello World     '
>>> 
>>> #在對齊符的前面加上要填充的字符即可
... 
>>> format(text,'-^20s')
'----Hello World-----'
>>> format(text,'-^20')
'----Hello World-----'
>>> #當格式化多個值的時候,這些格式代碼也可以被用在format()方法中：
... 
>>> '{:>10s} {:>10s}'.format('Hello','World')
'     Hello      World'
>>> #format()函數適用於任何值
... 
>>> x = 1.2534
>>> format(x,'>10')
'    1.2534'
>>> format(x,'^10.2f')
'   1.25   '
>>>

字符拼接

>>> a = 'beijing'
>>> b = 'is'
>>> c = 'center'
>>> print(a + ':' + b + ':' + c)
beijing:is:center
>>> print(':'.join([a,b,c]))
beijing:is:center 
>>> print(a,b,c,sep=':') #best
beijing:is:center
>>>

字符串中插入變量

>>> s = '{name} has {n} messages.'
>>> s.format(name='QHS',n= 200)
'QHS has 200 messages.'
>>> #變量在作用域可以找到,可以結合使用format_map()和vars()
... 
>>> name = 'QHS'
>>> n = 200
>>> s.format_map(vars())
'QHS has 200 messages.'
>>> # vars()也適用於對象實例
... 
>>> class Info:
...     def __init__(self,name,n):
...        self.name = name
...        self.n=n
... 
>>>  
>>> a = Info('QHS',200)
>>> s.format_map(vars())
'QHS has 200 messages.'
>>> #變量缺失時，format、format_map()會報錯
... 
>>> s.format(name='QHS')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'n'
>>>

一種避免這個錯誤的方法是另外定義一個含有__missing__方法的字典對象：

>>> class safesub(dict):
...     """防止 key 找不到"""
...     def __missing__(self,key):
...         return '{' + key + '}'
... 
>>> del n
>>> s
'{name} has {n} messages.'
>>> s.format_map(safesub(vars()))
'QHS has {n} messages.'
>>> 
>>> #可以將變量替換步驟用一個工具函數封閉起來:
... 
>>> import sys
>>> def sub(text):
...     return text.format_map(safesub(sys._getframe(1).f_locals))
... 
>>> 
>>> #就可以如下寫了
... 
>>> name = 'qhs'
>>> n = 200
>>> print(sub('Hello {name}'))
Hello qhs
>>> print(sub('You have {n} messages.'))
You have 200 messages.
>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}
>>>

sub() 函數使用sys. getframe(1) 返回調用者的棧幀。可以從中訪問屬性f_locals 來獲得局部變量。毫無疑問絕大部分情況下在代碼中去直接操作棧幀應該是不推薦的。但是，對於像字符串替換工具函數而言它是非常有用的。另外，值得
注意的是f locals 是一個復制調用函數的本地變量的字典。盡管你可以改變f_locals的內容，但是這個修改對於后面的變量訪問沒有任何影響。所以，雖說訪問一個棧幀看上去很邪惡，但是對它的任何操作不會覆蓋和改變調用者本地變量的值。

指定列寬格式化字符串

有些長字符串，想以指定的列寬將它們重新格式化。使用textwrap模塊來格式化字符串的輸出。

>>> s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
... the eyes, not around the eyes, don't look around the eyes, \
... look into my eyes, you're under."
>>> import textwrap
>>> print(textwrap.fill(s,50))
Look into my eyes, look into my eyes, the eyes,
the eyes, the eyes, not around the eyes, don't
look around the eyes, look into my eyes, you're
under.
>>> 
>>> print(textwrap.fill(s,70))
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.
>>> 
>>> print(textwrap.fill(s,40,initial_indent='    '))
    Look into my eyes, look into my
eyes, the eyes, the eyes, the eyes, not
around the eyes, don't look around the
eyes, look into my eyes, you're under.
>>> 
>>> print(textwrap.fill(s,40,subsequent_indent='    '))
Look into my eyes, look into my eyes,
    the eyes, the eyes, the eyes, not
    around the eyes, don't look around
    the eyes, look into my eyes, you're
    under.
>>> 
>>> #當希望自動匹配終端大小時，可以使用os.get_terminal_size()方法來獲取終端的大小尺寸
... 
>>> import os
>>> os.get_terminal_size().columns
196
>>>

fill()方法接受一些其他可選參數來控制tab,語句結尾等。

在字符串里處理 html 和 xml

比如：要將&entity/&code，替換為對應的文本。或者轉換文本中特定的字符（比如 <，>, &)

>>> #替換文本字符串的'<'或者'>' 使用html.escape()
... 
>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>> #disable escaping of quotes
... 
>>> print(html.escape(s,quote=False))
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".
>>>

如果再處理ASCII文本，並且想將非ASCII文本對應的編碼實體嵌入進去，可以給某些I/O函數傳遞參數 errors = 'xmlcharrefreplace' 來達到這個目的。

>>> s = 'Spicy Jalapeño'
>>> s.encode('ascii',errors='xmlcharrefreplace')
b'Spicy Jalape&#241;o'
>>>
#如果要解釋出文本的原碼，要使用html/xml的解釋器
>>> from html.parser import HTMLParser
>>> s = 'Spicy Jalape&#241;o'
>>> p = HTMLParser()
>>> p.unescape(s)
'Spicy Jalapeño' 
>>> 
>>> t = 'The prompt is &gt;&gt;&gt;'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'
>>>

字符串令牌解析

當你想把一個字符串從左至右將其解析為一個令牌流時。

有如下一個文本字符串：

text = 'foo = 23 + 42 * 10'

為了令牌化字符串，你不僅需要匹配模式，還得指定模式的類型。比如，你可能想將字符串像下面這樣轉換為序列對：

tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),('NUM', '42'), ('TIMES', '*'), ('NUM', 10')]

為了執行如下定義的切分，第一步就得利用命名捕獲組的正則表達式來定義所有可能的令牌，包括空格：

>>> import re
>>> NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
>>> NUM = r'(?P<NUM>\d+)'
>>> PLUS = r'(?P<PLUS>\+)'
>>> TIMES = r'(?P<TIMES>\*)'
>>> EQ = r'(?P<EQ>=)'
>>> WS = r'(?P<WS>\s+)'
>>> master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
>>> 
>>> master_pat
re.compile('(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)|(?P<NUM>\\d+)|(?P<PLUS>\\+)|(?P<TIMES>\\*)|(?P<EQ>=)|(?P<WS>\\s+)')
>>> # 其中?P<TOKENNAME>用於給一個模式命名，供后面使用
... 
>>>下一步，為了令牌化，使用模式對象的scanner() 方法。這個方法會創建一個scanner 對象，在這個對象上不斷的調用match() 方法會一步步的掃描目標文本，每步一個匹配。下面是演示一個scanner 對象如何工作的交互式例子：
>>> scanner = master_pat.scanner('foo = 42')
>>> scanner.match()
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> _.lastgroup, _.group()
('NAME', 'foo')
>>> scanner.match()
<_sre.SRE_Match object; span=(3, 4), match=' '>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object; span=(4, 5), match='='>
>>> _.lastgroup, _.group()
('EQ', '=')
>>> scanner.match()
<_sre.SRE_Match object; span=(5, 6), match=' '>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object; span=(6, 8), match='42'>
>>> _.lastgroup, _.group()
('NUM', '42')
>>>

實際使用這種技術的時候，可以很容易的像將上述代碼打包到一個生成器中：

#命名元組如下：
>>> import collections
>>> Person = collections.namedtuple('Person','name age gender')
>>> print('Type of Person:',type(Person))
Type of Person: <class 'type'>
>>> 
>>> Bob = Person(name='Bob', age=30,gender='male')
>>> print('Representation:',Bob)
Representation: Person(name='Bob', age=30, gender='male')
>>> print(Bob.name)
Bob
>>> print(Bob.name,Bob.age,Bob.gender)
Bob 30 male
>>> print("{} is {} years old {}".format(Bob)
... 
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: tuple index out of range
>>> 
>>> print("%s is %d years old %s." % Bob)
Bob is 30 years old male.
>>>
# 以下是生成一個解析的生成器
>>> from collections import namedtuple
>>> def generate_tokens(pat,text):
...     Token = namedtuple('Token',['type','value'])
...     scanner = pat.scanner(text)
...     for m in iter(scanner.match, None):
...         yield Token(m.lastgroup, m.group())
... 
>>> import re
>>> NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
>>> NUM = r'(?P<NUM>\d+)'
>>> PLUS = r'(?P<PLUS>\+)'
>>> TIMES = r'(?P<TIMES>\*)'
>>> EQ = r'(?P<EQ>=)'
>>> WS = r'(?P<WS>\s+)'
>>> 
>>> master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
>>> 
>>> for tok in generate_tokens(master_pat,'foo = 42'):
...     print(tok)
... 
Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')
>>>

如果一個模式恰好是另一個更長模式的子字符串，那么你需要確定長模式寫在前面。比如：

>>> LT = r'(?P<LT><)'
>>> LE = r'(?P<LE><=)'
>>> EQ = r'(?P<EQ>=)'
>>> master_pat = re.compile('j'.join([LE, LT, EQ])) # 正確
>>> # master_pat = re.compile('j'.join([LT, LE, EQ])) # 錯誤
字節字符串上的字符串操作

在字節字符串上執行普通的文本操作移除、搜索、替換；支持大部分和文本字符串一樣的內置操作：

>>> data = b'Hello World'
>>> data[0:5]
b'Hello'
>>> data.split()
[b'Hello', b'World']
>>> 
>>> data.replace(b'Hello',b'Hello bad')
b'Hello bad World'
>>> #同樣適用於字節數組
... 
>>> data = bytearray(b'Hello World')
>>> data[:5]
bytearray(b'Hello')
>>> data.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> 
>>> import re
>>> data = b'qi:heng:shan'
>>> re.split('[:]',data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python3.4/lib/python3.4/re.py", line 200, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: can't use a string pattern on a bytes-like object
>>> 
>>> re.split(b'[:]',data)
[b'qi', b'heng', b'shan']
>>>

區別文本字符串的索引操作會返回對應的字符，字節字符串的索引操作則返回整數：

>>> a = 'Hello World'
>>> b = b'Hello World'
>>> a[0]
'H'
>>> b[0]
72
>>> print(b)
b'Hello World'
>>> print(b.decode('ascii'))
Hello World
>>> 要先解碼成文本字符串，才能正常打印出來
# 字節字符串沒有格式化的操作
#如果想格式化字節字符串，得先使用標准的文本字符串，然后將其編碼為字節字符串
>>> '{:10s} {:10s} {:>10s}'.format('python','is','good').encode('ascii')
b'python     is               good'

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python字符串與文本操作(一) 字符、字符串和文本的處理之Char類型字符、字符串和文本的處理之String類型 Pandas字符串和文本數據 Python字符串操作 python字符串操作 python 字符串操作 Python 字符串操作 python讀取多行字符串文本 python字符串及字符串操作