Python 3之str類型、string模塊學習筆記

本文轉載自查看原文 2018-09-03 17:28 10416 Python

Windows 10家庭中文版，Python 3.6.4，

Python 3.7官文：

Text Sequence Type — str

string — Common string operations

str類型

Python（特指Python 3）中包含字符串，字符串的類型為str，字符串是Unicode碼點（Unicode code codepoint）的序列，屬於不可變類型。

字符串有三種寫法：

單引號（Single quotes）、雙引號（Double quotes）、三引號（Triple quoted）。

單雙引號可以互相嵌套，三引號可以嵌套單雙引號，使得字符串擴展為多行。若要嵌套自身，需要用反斜杠轉移。

還可以使用str構造函數創建字符串：

class str(object='')
class str(object=b'', encoding='utf-8', errors='strict')

注意，第二個構造函數是基於bytes（准確的說法是 a bytes-like object (e.g. bytes or bytearray)）構造字符串，也即實現bytes轉字符串的功能，但是要寫對encoding參數。

注意，str(bytes, encoding, errors)和bytes.decode(encoding, errors)功能相同。

新：

兩個字符串字面量之間只有空格時，它們會被自動轉換為一個字符串字面量。

>>> "sdfs" "www"
'sdfswww'
>>> ("sdfs" "www")
'sdfswww'
>>> "sdfs"         "www" # 多個空格
'sdfswww'

參考：字符串字面量的語法（有些復雜，不是一眼就可以看懂的，進階的話可以dig，此處略過）

字符串是不可變的，但是，可以使用str.join()方法創造字符串，或者使用io模塊的io.StringIO函數構造字符串，兩者原型如下：

str.join(iterable)

class io.StringIO(initial_value='', newline='\n')

后者還需要dig，前者略懂一二。

自己一直以來沒有搞明白字符串前面添加 r、u 做什么？現在OK了：

-r 表示字符串中所有的字符表示其本身，比如，反斜杠就是反斜杠，不是用來轉義的，'\n' 表示換行符，是一個字符，而 r'\n' 則是兩個字符——一個反斜杠、一個小寫n。

-u 表示字符串是Unicode字符串，在Python 3中保留是為了兼容Python 2，而Python 3中的字符串默認都是Unicode字符串，在Python 3中，不需要添加，而且不能和 r 一起使用。

repr()函數的用法：什么時候用？不是很清楚~菜鳥教程中的解釋：repr() 函數將對象轉化為供解釋器讀取的形式——使用eval()。

下面是一些測試：

>>> x = 'n'
>>> y = '\n'
>>> d1 = 123
>>> f1 = 999.87
>>> repr(x), repr(y), repr(d1), repr(f1)
("'n'", "'\\n'", '123', '999.87')
>>> len(repr(x)), len(repr(y)), len(repr(d1)), len(repr(f1))
(3, 4, 3, 6)
>>> eval(repr(x)), eval(repr(y)), eval(repr(d1)), eval(repr(f1))
('n', '\n', 123, 999.87)

Python字符串的對象屬性、方法——使用dir(str)可以看到全部方法（methods，外部可以直接調用的有44個）：

>>> for attr in dir(str):
	print(attr, type(eval('str.%s' % attr)))

__add__ <class 'wrapper_descriptor'>
__class__ <class 'type'>
__contains__ <class 'wrapper_descriptor'>
__delattr__ <class 'wrapper_descriptor'>
__dir__ <class 'method_descriptor'>
__doc__ <class 'str'>
__eq__ <class 'wrapper_descriptor'>
__format__ <class 'method_descriptor'>
__ge__ <class 'wrapper_descriptor'>
__getattribute__ <class 'wrapper_descriptor'>
__getitem__ <class 'wrapper_descriptor'>
__getnewargs__ <class 'method_descriptor'>
__gt__ <class 'wrapper_descriptor'>
__hash__ <class 'wrapper_descriptor'>
__init__ <class 'wrapper_descriptor'>
__init_subclass__ <class 'builtin_function_or_method'>
__iter__ <class 'wrapper_descriptor'>
__le__ <class 'wrapper_descriptor'>
__len__ <class 'wrapper_descriptor'>
__lt__ <class 'wrapper_descriptor'>
__mod__ <class 'wrapper_descriptor'>
__mul__ <class 'wrapper_descriptor'>
__ne__ <class 'wrapper_descriptor'>
__new__ <class 'builtin_function_or_method'>
__reduce__ <class 'method_descriptor'>
__reduce_ex__ <class 'method_descriptor'>
__repr__ <class 'wrapper_descriptor'>
__rmod__ <class 'wrapper_descriptor'>
__rmul__ <class 'wrapper_descriptor'>
__setattr__ <class 'wrapper_descriptor'>
__sizeof__ <class 'method_descriptor'>
__str__ <class 'wrapper_descriptor'>
__subclasshook__ <class 'builtin_function_or_method'>
capitalize <class 'method_descriptor'>
casefold <class 'method_descriptor'>
center <class 'method_descriptor'>
count <class 'method_descriptor'>
encode <class 'method_descriptor'>
endswith <class 'method_descriptor'>
expandtabs <class 'method_descriptor'>
find <class 'method_descriptor'>
format <class 'method_descriptor'>
format_map <class 'method_descriptor'>
index <class 'method_descriptor'>
isalnum <class 'method_descriptor'>
isalpha <class 'method_descriptor'>
isdecimal <class 'method_descriptor'>
isdigit <class 'method_descriptor'>
isidentifier <class 'method_descriptor'>
islower <class 'method_descriptor'>
isnumeric <class 'method_descriptor'>
isprintable <class 'method_descriptor'>
isspace <class 'method_descriptor'>
istitle <class 'method_descriptor'>
isupper <class 'method_descriptor'>
join <class 'method_descriptor'>
ljust <class 'method_descriptor'>
lower <class 'method_descriptor'>
lstrip <class 'method_descriptor'>
maketrans <class 'builtin_function_or_method'>
partition <class 'method_descriptor'>
replace <class 'method_descriptor'>
rfind <class 'method_descriptor'>
rindex <class 'method_descriptor'>
rjust <class 'method_descriptor'>
rpartition <class 'method_descriptor'>
rsplit <class 'method_descriptor'>
rstrip <class 'method_descriptor'>
split <class 'method_descriptor'>
splitlines <class 'method_descriptor'>
startswith <class 'method_descriptor'>
strip <class 'method_descriptor'>
swapcase <class 'method_descriptor'>
title <class 'method_descriptor'>
translate <class 'method_descriptor'>
upper <class 'method_descriptor'>
zfill <class 'method_descriptor'>

View Code

'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 
'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 
'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 
'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 
'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 
'swapcase', 'title', 'translate', 'upper', 'zfill'

包括查找、去除左右空格、判斷字符串元素的類別、分隔——中文分隔需要用re模塊、大小寫轉換、轉換為bytes——encode、格式化字符串——本文后面會簡單介紹、居中、左右對齊、替換replace等。

string模塊

string模塊包含了一些字符串常量，另外還有Formatter類、Template類和一個幫助函數capwords（string.capwords(s, sep=None)）。

其中，Formatter類型用於字符串格式化，繼承它可以開發自定義的格式化類；Template類提供簡單的字符串替換功能，主要用途是上下文的國際化（internationalization (i18n)）。

字符串常量包括——感覺用處不是很大：

string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase
string.digits
string.hexdigits
string.octdigits
string.punctuation
string.printable
string.whitespace

下面是測試，可是，發生了錯誤，和上面講的一條規則沖突了——沒有連接起來：

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.ascii_letters string.digits
SyntaxError: invalid syntax
>>> '2323' 'sdsds'
'2323sdsds'
>>> 
>>> type(string.digits)
<class 'str'>
>>> type(string.ascii_letters)
<class 'str'>

學習筆記：

學習了一遍str、string，發現string幾乎很難用到，字符串類型的大部分功能都在str類型中，除了Template類的使用，當然，這個也可以使用str本身的格式化功能實現，當然，Template會更便捷——語法相對來說較為簡單。

關於Formatter類，string模塊官文說它和str.format()函數進行格式化轉換時使用的是相同的語法，但是，開發者可以繼承Formatter類實現自己特有的格式化字符串功能——絕大部分開發者用不到吧？兩者的語法都是和Formatted string literals相關，但又有不同之處——請查看官文。

字符串格式化簡介

經過前面的學習，發現Python字符串有4種格式化的語法：

1.printf-style String Formatting

format % values

2.the newer formatted string literals

A formatted string literal or f-string is a string literal that is prefixed with 'f' or 'F'.

解釋：在字符串字面量前面添加 f or F，即可使用當前命名空間中的元素來格式化字符串了，不需要像其它語法一樣把格式化字符串和變量放在一起，，的確有些高級呢！

使用示例：

>>> name = "Fred"
>>> f"He said his name is {name!r}."
"He said his name is 'Fred'."
>>> f"He said his name is {repr(name)}."  # repr() is equivalent to !r
"He said his name is 'Fred'."
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"  # nested fields
'result:      12.35'
>>> today = datetime(year=2017, month=1, day=27)
>>> f"{today:%B %d, %Y}"  # using date format specifier
'January 27, 2017'
>>> number = 1024
>>> f"{number:#0x}"  # using integer format specifier
'0x400'

3.the str.format() interface / string.Formatter

str.format(*args, **kwargs)

str.format_map(mapping) 和 str.format(**mapping) 功能相同（相似，原文：Similar to str.format(**mapping), except that mapping is used directly and not copied to a dict.）

示例：

>>> "The sum of 1 + 2 is {0}".format(1+2)
'The sum of 1 + 2 is 3'

>>> '{0}, {1}, {2}'.format('a', 'b', 'c')
'a, b, c'
>>> '{}, {}, {}'.format('a', 'b', 'c')  # 3.1+ only
'a, b, c'
>>> '{2}, {1}, {0}'.format('a', 'b', 'c')
'c, b, a'
>>> '{2}, {1}, {0}'.format(*'abc')      # unpacking argument sequence
'c, b, a'
>>> '{0}{1}{0}'.format('abra', 'cad')   # arguments' indices can be repeated
'abracadabra'

>>> 'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'
>>> coord = {'latitude': '37.24N', 'longitude': '-115.81W'}
>>> 'Coordinates: {latitude}, {longitude}'.format(**coord)
'Coordinates: 37.24N, -115.81W'

str.format示例

更多示例請查看string模塊下的Format Examples，有不少高級或更復雜的用法，適合進階使用。

4.template strings（PEP 292）

支持使用美元符號$來做替換，$identifier、${identifier}兩種替換方式，兩個美元符號（$$）為轉義，代表一個美元符號$。的確挺簡單的。

class string.Template(template)

-substitute(mapping, **kwds)

-safe_substitute(mapping, **kwds)

開發者可以繼承Template類，實現自定義的模板類。

官文使用示例：

>>> from string import Template
>>> s = Template('$who likes $what')
>>> s.substitute(who='tim', what='kung pao')
'tim likes kung pao'
>>> d = dict(who='tim')
>>> Template('Give $who $100').substitute(d)
Traceback (most recent call last):
...
ValueError: Invalid placeholder in string: line 1, col 11
>>> Template('$who likes $what').substitute(d)
Traceback (most recent call last):
...
KeyError: 'what'
>>> Template('$who likes $what').safe_substitute(d)
'tim likes $what'

string.Template示例

學習筆記：

語法1類似於C語言的priintf函數的格式化字符串方法；

語法2請查看參考鏈接2，孤還沒有細讀；

語法3在str.format()函數和string模塊的Formatter類中使用，和語法2有關聯——基於語法2？；

語法4是string模塊提供的一種簡單的字符串替換功能。

都知道怎么使用了，基本的使用，但是，更有難度的是理解它們的語法，下面補充語法2、語法3的描述（官文，具體解釋也請查看官文），這兩個是最難的：

語法2：

f_string          ::=  (literal_char | "{{" | "}}" | replacement_field)*
replacement_field ::=  "{" f_expression ["!" conversion] [":" format_spec] "}"
f_expression      ::=  (conditional_expression | "*" or_expr)
                         ("," conditional_expression | "," "*" or_expr)* [","]
                       | yield_expression
conversion        ::=  "s" | "r" | "a"
format_spec       ::=  (literal_char | NULL | replacement_field)*
literal_char      ::=  <any code point except "{", "}" or NULL>

語法3：

format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
fill            ::=  <any character>
align           ::=  "<" | ">" | "=" | "^"
sign            ::=  "+" | "-" | " "
width           ::=  digit+
grouping_option ::=  "_" | ","
precision       ::=  digit+
type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"

疑問：

這些人怎么想到使用上面的方式來表示語法呢？在計算機科學中，上面的結構叫做什么？好像自己在其它的文檔中也看到過，只是，不理解，歡迎讀者賜教！和編譯原理有關系嗎？能寫出上面語法的人一定很聰明吧，或者，在計算機科學上有很高的造詣！當然，很可能是站在某些計算機科學先驅的肩膀上，比如C語言的創造者們，當然，還可以繼續追溯。

本文就這樣吧，幾乎涵蓋了Python的str類型、string模塊的各個知識點，暫且交差。其中，str中的字符串函數還需要重難點練習突破，格式化字符串還需要更多場景來練習突破（官文示例好好研究下）。

更進一步

在學習過程中發現，str.split函數在分隔漢語句子時失敗了，需要用re模塊的分隔函數，此問題以及中文的相關問題（中文分詞？中文詞雲？自然語言識別？）還需dig：

>>> cnstr = '姑娘還曬出了自己的辭職信，引發眾多網友關注。姑娘說，她在這家公司上班6年，一個月3.5k左右。在辭職信中，她列出了7條離職原因。沒想好做什么，但是不能繼續這樣下去了’'
>>> cnstr.split('了的')
['姑娘還曬出了自己的辭職信，引發眾多網友關注。姑娘說，她在這家公司上班6年，一個月3.5k左右。在辭職信中，她列出了7條離職原因。沒想好做什么，但是不能繼續這樣下去了’']
>>> len(cnstr.split('了的')) # 分隔失敗，返回列表長度為1
1

>>> import re
>>> re.split('[了的]', cnstr) # '[了的]' 是正則表達式
['姑娘還曬出', '自己', '辭職信，引發眾多網友關注。姑娘說，她在這家公司上班6年，一個月3.5k左右。在辭職信中，她列出', '7條離職原因。沒想好做什么，但是不能繼續這樣下去', '’']

參考鏈接

1.Python3 字符串 from RUNOOB.COM

2.Formatted string literals

3.Python中文本分割的具體方式

str.join(iterable)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python: 如何繼承str/string？ python之路--str類型 Python requests模塊學習筆記 python學習筆記之九：模塊和包 Python學習筆記：pandas.Series.str.split分列 Python學習筆記：pandas.Series.str.cat拼接字段 python學習筆記(42) config模塊 python學習筆記（xlsxwriter模塊使用） python學習筆記--Paramiko模塊安裝和使用 Python 2.7 學習筆記模塊和包