python提取字符串的中英文

前言

提取中英文是我們在做數據處理時候經常使用的，最高效的做法就是通過正則判斷了，下面是我寫的筆記，希望對你有用

一. re中的sub函數

使用Python 的re模塊，re模塊提供了re.sub用於替換字符串中的匹配項。

re.sub(pattern, repl, string, count=0)

參數說明：

pattern：正則重的模式字符串
repl：被拿來替換的字符串
string：要被用於替換的原始字符串
count：模式匹配后替換的最大次數，省略則默認為0，表示替換所有的匹配

1.1 提取中文

可以這樣想：我們可以通過將不是中文的字符替換為空不就可以了

例如

 
                  import 
                  re 
                 
                  str 
                  = 
                  "重出江湖hello的地H方。。的,world" 
                 
                  str 
                  = 
                  re.sub( 
                  "[A-Za-z0-9\,\。]" 
                  , "",  
                  str 
                  ) 
                 
                  print 
                  ( 
                  str 
                  ) 
                 
                  輸出：神的孩子在唱歌

1.2 提取英文

 
                  import 
                  re 
                 
                  str 
                  = 
                  "重123出江湖hello的地H方。。的,world" 
                 
                  str 
                  = 
                  re.sub( 
                  "[\u4e00-\u9fa5\0-9\,\。]" 
                  , "",  
                  str 
                  ) 
                 
                  print 
                  ( 
                  str 
                  ) 
                 
                  輸出：helloHworld

1.3 提取數字

 
                  import 
                  re 
                 
                  str 
                  = 
                  "重123出江湖hello的地H方。。的,world" 
                 
                  str 
                  = 
                  re.sub( 
                  "[A-Za-z\u4e00-\u9fa5\,\。]" 
                  , "",  
                  str 
                  ) 
                 
                  print 
                  ( 
                  str 
                  ) 
                 
                  輸出： 
                  123

二. re中的findall函數

在字符串中找到正則表達式所匹配的所有子串，並返回一個列表，如果沒有找到匹配的，則返回空列表。

語法格式為：

findall(string[, pos[, endpos]])

參數：

string : 待匹配的字符串。
pos : 可選參數，指定字符串的起始位置，默認為 0。
endpos :可選參數，指定字符串的結束位置，默認為字符串的長度。查找字符串中的所有數字：

擴展：正則中有match 和 search ，它們是是匹配一次，findall

匹配所有，具體了解可以到菜鳥教程查看

Python客棧送紅包、紙質書

2.1 提取中文

2.2 提取英文

通俗寫法

 
                  import 
                  string 
                  #提供a-z的小寫字母 
                 
                  dd  
                  = 
                  "神的孩子hello在H唱歌,world" 
                 
                  #准備英文字符 
                 
                  temp 
                  = 
                  "" 
                 
                  letters 
                  = 
                  string.ascii_lowercase 
                  #包含a-z的小寫字母 
                 
                  for 
                  word  
                  in 
                  dd: 
                  #for循環取出單個詞 
                 
                  if 
                  word.lower()  
                  in 
                  letters: 
                  #判斷是否是英文 
                 
                  temp 
                  + 
                  = 
                  word 
                  #添加組成英文單詞 
                 
                  print 
                  (temp) 
                 
                  輸出：helloHworld

正則

 
                  #A-Za-z 
                 
                  import 
                  re 
                 
                  dd  
                  = 
                  "重出123江湖hello的地方的,world" 
                 
                  result  
                  = 
                  ' 
                  '.join(re.findall(r' 
                  [A 
                  - 
                  Za 
                  - 
                  z]', dd))  
                 
                  print 
                  (result) 
                 
                  輸出：helloHworld

2.3 提取數字

 
                  #0-9注意這個數字前面不能\,要不然他連，都給算上 
                 
                  import 
                  re 
                 
                  dd  
                  = 
                  "神123的孩子hello在唱H歌。。,world" 
                 
                  result  
                  = 
                  ' 
                  '.join(re.findall(r' 
                  [ 
                  0 
                  - 
                  9 
                  ]', dd))  
                 
                  print 
                  (result) 
                 
                  輸出： 
                  123

三. re中的compile函數

compile函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象，供其他函數使用。

語法格式為：

re.compile(pattern[, flags])

參數：

pattern : 一個字符串形式的正則表達式

flags : 可選，表示匹配模式，比如忽略大小寫，多行模式等，具體參數為：

re.I 忽略大小寫
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依賴於當前環境
re.M 多行模式
re.S即為 . 並且包括換行符在內的任意字符（. 不包括換行符）
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s,\S 依賴於 Unicode 字符屬性數據庫
re.X 為了增加可讀性，忽略空格和 # 后面的注釋

3.1 同時匹配中英文數字去除其他字符

python提取字符串的中英文

感謝：https://www.jb51.net/article/212177.htm

前言

一. re中的sub函數

二. re中的findall函數

三. re中的compile函數

免責聲明！