import numpy as np
import pandas as pd
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing.(Python非常流行的一個原因在於它對字符串處理提供了非常靈活的操作方式). Most text operations are made simple with string object's built-in methods. For more complex pattern matching and text manipulations, reqular expressions may be needed(對於非常復雜的字符串操作,正則還是非常必要的). pandas adds to the mix by enabling you to apply string and reqular expressions concisely(簡明地) on whole arrays of data, additionally handling the annonyance(煩惱) of missing data.
字符串對象常用方法
In many string munging and scriptiong applications, built-in methods are sufficient(內置的方法就已夠用). As a example, a comma-separated string can be broken into pieces with split:
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
split is offen combined with strip to trim whitesplace(including line breaks): (split 通常和strip配合使用哦)
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
These subtrings could be concatenated together with a two-colon delimiter using additon:
first, second, thrid = pieces # 拆包
first + "::" + second + "::" + thrid
'a::b::guido'
But this isn't a practical(實際有效) generic mathod. A faster and more Pythonic way is to pass a list or tuple to the join method on the string "::".
'::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python's in keyword is the best way to detect a substring, though index and find can also be used:
"guido" in val
True
val.index(',') # 下標索引位置
1
val.find(":") # 返回第一次出現的下標, 沒有則返回 -1
-1
Note the difference between find and index is that index raises an exception if the string isn't found (versus 相對於index的報錯, find 返回 -1, 健壯性好)
val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-2c016e7367ac> in <module>
----> 1 val.index(':')
ValueError: substring not found
val.find(":")
Relatedly, count returns the number of occurrences of a particular substring:
val.count(',')
replace will substitute(替換) occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:
val
val.replace(',', ':') # 是深拷貝, 創建新對象了哦
'a:b: guido'
val # 原來的沒變哦
'a,b, guido'
val.replace(',', '') # 替換為空
'ab guido'
See Table 7-3 for a listing of some of Python's string methods.
Regular expressions can also be used with many of these operations, as you'll see.
Argument | Description |
---|---|
count | 計數某元素出現的次數 |
endswith | Return True if string ends with suffix |
startswith | 判斷是否以某元素結尾 |
join | 字符串拼接 |
index | 返回某元素第一次出現的下標, 沒有則報錯 |
find | 返回某元素第一次出現的下標,沒有則返回-1 |
rfind | 從右邊往左開始尋找 |
replace | 替換某元素 |
strip | 清除兩側空白符 |
rstrip | for each element |
lstrip | |
split | 分割 |
lower | 小寫 |
upper | 大寫 |
casefold | 將字符轉換為小寫,並將任何特定於區域的變量字符組合轉換為常見形式 |
ljust | 調整字符距離 |
rjust |
正則表達式
Regular expression provide a flexible way to search or match(often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed(形成的) according to the regular expression language. Python's built-in re module is responsible for applying regular expressions to strings; I'll give a number of examples of its use here.
The art of writing regular expressions could be a chapter of its own and thus is outside the book's scope. There are many excellent tutorials and references available on the internet and in other books.
The re module functions fall into three categories:pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let's look at a simple example:
Suppose we want to split a string with a variable number of whitespace characters(tabs, spaces, and newlines). The regex describing one or more whitespace characters is "\s+":
import re
text = "foo bar\t baz \tqux"
re.split("\s+", text) # 按空白符分割
['foo', 'bar', 'baz', 'qux']
When you call re.split('\s+', text), the regular expression is first compiled, and then its split method method is called on the passed text. You can complie the regex yourself with re.compile forming a reusable regex object:
regex = re.compile('\s+') # cj 編譯模式在代碼復用時挺好
regex.split(text)
['foo', 'bar', 'baz', 'qux']
If, instead(替換), you want to get a list of all patterns matching the regex, you can use the findall method:
regex.findall(text) # cj,匹配所有滿足要求的, 並返回列表
[' ', '\t ', ' \t']
To avoid unwanted escaping with \ in a regular expression, use raw string literals(原生字面符) like r'C:\x' instead of the equivalent 'C:\x'
Creating a regex object with re.complie is highly recommended if you intent to apply the same expression to many strings; doing so will save CPU cycles(周期)
(提高代碼復用, 節省CPU空間)
match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly(嚴格地), match only matches at the beginning of the string. As a less trivial(不重要地)example, let's consider a block of text and a regular expression capable(能干的) of identifying most email addresses:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
"匹配出所有郵箱"
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
'匹配出所有郵箱'
Using findall on the text produces a list of the email addresses:
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:
m = regex.search(text) # 只返回第一個匹配到的結果
m # 是一個Match對象
<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
text[m.start():m.end()]
'dave@google.com'
regex.match returns None, as it only will mathch if the pattern occurs at the start of the string:
# 第一個參數必須是正則表達式, 沒有匹配則None
print(regex.match(text))
None
Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string.
# 參數: pattern, replace_value, text, count
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
Suppose you wanted to find email addresses and simultaneously(同時地) segment each address into its three components(部分): username, domain name, and domain suffix. To do this, put parentheses around the parts of pattern to segment:
pattern = r'([a-z0-9+_.%-]+)@([a-z0-9+-._]+)\.([a-z0-9]{2,4})' # () 用來分組
regex = re.compile(pattern, flags=re.IGNORECASE)
A match object produced by this modified regex return a tuple of the pattern components with its groups method:
m = regex.match("wesm@bring.net")
m.groups()
('wesm', 'bring', 'net')
findall returns a list of tuples when the pattern has groups:
regex.findall(text) # 數據清洗非常有用啊,正則
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub also has access to groups in each match using special symbols like \1 and \2. The symbol \1 correspons to the first matched group, \2 corresponds to the second, and so forth:
"感覺真的是數據清洗的利器"
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
'感覺真的是數據清洗的利器'
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
There is much more to regular expression in Python, most of which is outside the book's scope, Table 7-4 provides a brief summary.
Argument | Description |
---|---|
findall | 匹配所有滿足條件的元素, 返回是個列表 |
finditer | Like findall, but returns an iterator |
match | 從頭開始嚴格匹配, 一旦匹配到則返回match對象, 否則None |
search | 所有滿足條件的元素從任意位置, 匹配放回match對象, 否則None |
split | 按正則表達式分割 |
sub, subn | 替換匹配字串,返回新字串, \1, \2..分組顯示等 |
批量字符串處理
Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool
You can apply string and regular expression methods can be applied(passing a lambda or other function) to each value using data.map, but it will fail on the NA values(apply能傳一個方法去處理去映射每個元素, 但缺失值就麻爪了). To cope with(處理)this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series's str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains
data.str.contains("gmail") # like 'in'
Dave False
Steve True
Rob True
Wes NaN
dtype: object
Regular expressions can be used, too, along with any re option like IGNORECASE:
pattern
'([a-z0-9+_.%-]+)@([a-z0-9+-._]+)\\.([a-z0-9]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE) # 映射每個元素
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
There are a couple of(一對) ways to do vectorized element retrieval. Either use str.get or index into the str attribute:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
To access elements in the embedded lists(列表嵌套), we can pass an index to either of these functions:
matches.str.get(1)
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
matches.str[0]
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
You can similarly slice strings using this syntax:
data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object
See Table 7-5 for more pandas string methods
- cat
- contains
- count
- extract 用正則表達式提取
- endswith
- startswith
- findall
- get index into each element
- isalnum 判斷是否為字母or數字
- islaph
- isdecimal
- isdigit
- islower
- isupper
- isnumeric
- join
- len
- lower/ upper
- match
- pad Add whitespace to left, right or both sides of strings
- repeat
- replace
- slice
- split
- strip
- rstrip
- lstrip
小結
Effective data preparation can significantly improve productive by enabling you to spend more time analyzing data and less time getting it ready for analyingsis.
(能高效便捷進行數據清洗和預處理能讓我們有更多的時間去分析問題而非一直在處理數據)
We have explored a number of tools in this chapter, but the coverage here is by no means comprehensive. In the next chapter, we will explore pandas's joining and grouping functionality.