python 正則表達式與JSON-正則表達式匹配數字、非數字、字符、非字符、貪婪模式、非貪婪模式、匹配次數指定等


1、正則表達式:目的是為了爬蟲,是爬蟲利器。

正則表達式是用來做字符串匹配的,比如檢測是不是電話、是不是email、是不是ip地址之類的

2、JSON:外部數據交流的主流格式。

3、正則表達式的使用 

re python 內置的模塊,可以進行正則匹配

re.findall(pattern,source)
pattern:正則匹配規則-也叫鄭澤表達式
source:需要查找的目標源

import re
a = "C0C++7Java8C#Python6JavaScript"
res = re.findall("Java",a)
print res
# [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
# ['Java', 'Java']

4、正則表達式的應用

  • 查數字
  • 用概括字符集:\d
    import re
    a = "C0C++7Java8C#Python6JavaScript"
    res = re.findall("\d",a)
    print res
    # Project/python_ToolCodes/test10.py"
    # ['0', '7', '8', '6']


    用另外一種匹配模式-字符集:[0-9]
    import re
    a = "C0C++7Java8C#Python6JavaScript"
    res = re.findall("[0-9]",a)
    print res
    # Project/python_ToolCodes/test10.py"
    # ['0', '7', '8', '6']

    其中"Java"叫普通字符,"/d" 源字符

  • 查非數字
  • 用概括字符集:\D
    import re
    a = "C0C++7Java8C#Python6JavaScript"
    res = re.findall("\D",a)
    print res
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['C', 'C', '+', '+', 'J', 'a', 'v', 'a', 'C', '#', 'P', 'y', 't', 'h', 'o', 'n', 'J', 'a', 'v', 'a', 'S', 'c', 'r', 'i', 'p', 't']

    用另外一種匹配模式-字符集:[^0-9]
    import re
    a = "C0C++7Java8C#Python6JavaScript"
    res = re.findall("[^0-9]",a)
    print res
    # Project/python_ToolCodes/test10.py"
    # ['C', 'C', '+', '+', 'J', 'a', 'v', 'a', 'C', '#', 'P', 'y', 't', 'h', 'o', 'n', 'J', 'a', 'v', 'a', 'S', 'c', 'r', 'i', 'p', 't']
     
              

     

  • 正則表達式的羅列 :https://baike.baidu.com/item/正則表達式/1700215?fr=aladdin,挨個練習是沒有必要的,用到去查即可

4、匹配模式

  • 源字符+普通字符混合模式
[]中的或操作
#
coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #匹配acc和afc res = re.findall("a[cf]c",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['acc', 'afc']
取反操作:^
#
coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #取出非(acc和afc)的字符 res = re.findall("a[^cf]c",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['abc', 'adc', 'aec', 'ahc']
取范圍操作:-
#
coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #取出acc,adc,aec,afc(中間字符是c到f范圍的) res = re.findall("a[c-f]c",a) print res
 #[Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
 #['acc', 'adc', 'aec', 'afc']
 
  • 匹配數字和字母:
  • 概括字符集匹配:\w
    import
    re a = "abc&cba" res = re.findall("\w",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['a', 'b', 'c', 'c', 'b', 'a']
    使用字符集匹配:[A-Za-Z0-9]
    import
    re a = "abc123&cba321" res = re.findall("[A-Za-z0-9]",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['a', 'b', 'c', '1', '2', '3', 'c', 'b', 'a', '3', '2', '1']
     
               

     

     

    顯然,是\w是不匹配非字母和數字的,比如“&”符號

  • 匹配非單詞非數字字符
    概括字符集:\W
    import
    re a = "abc123&cba321" res = re.findall("\W",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['&']

    使用字符集匹配:^A-Za-z0-9
    import re
    a = "abc123&cba321"
    res = re.findall("[^A-Za-z0-9]",a)
    print res
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['&']
     
              

     

  • 空格、制表符、換行符號之類的匹配:\s
  • import re
    a = "python 111\tjava&67p\nh\rp"
    res = re.findall("\s",a)
    print res
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # [' ', '\t', '\n', '\r']

     

  • 匹配量詞:匹配出python Java php
     
  • 必須三個一組:
    [a-z]{3}
    import re
    a = "python 1111java678php"
    res = re.findall("[a-z]{3}",a)
    print res
    [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    ['pyt', 'hon', 'jav', 'php']


    可以3-6個一組:因為最長python 為6 最短PHP為3:
    [a-z]{3,6}

    import re
    a = "python 1111java678php"
    res = re.findall("[a-z]{3,6}",a)
    print res
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['python', 'java', 'php']
    疑問:為什么3個能匹配 匹配到pyt的時候為什么不終止?
    因為正則表達式的數量詞分為貪婪和非貪婪模式,默認情況下,python 認為是貪婪模式的。

    非貪婪模式怎么使用:加個問號
    [a-z]{3,6}?
    import re
    a = "python 1111java678php"
    res = re.findall("[a-z]{3,6}?",a)
    print res
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['pyt', 'hon', 'jav', 'php']

     

     

  • * ,對*前面的字符'n',匹配0次或者無限次
  • import re
    a = "pytho0python1pythonn2"
    res = re.findall("python*",a)
    print res
    
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['pytho', 'python', 'pythonn']

    比如pytho 沒有n 則是匹配0次,可匹配出來pytho;比如python 1個n 則是匹配1次,可匹配出來python;pythonn  2個n 則是匹配2次,可匹配出來pythonn

  • +,對+前面的字符'n' 匹配1次或者無限次
  • import re
    a = "pytho0python1pythonn2"
    res = re.findall("python+",a)
    print res
    
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['python', 'pythonn']

     

  • ?,?前面的字符'n' 匹配0次或者1次
  • import re
    a = "pytho0python1pythonn2"
    res = re.findall("python?",a)
    print res
    
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['pytho', 'python', 'python']

    比如pytho 沒有n 則是匹配0次,可匹配出來pytho;比如python 1個n 則是匹配1次,可匹配出來python;pythonn  2個n 則是匹配1次,可匹配出來python,因為多出來的n,直接被截斷了,不符合匹配模式,所以匹配不出來pythonn 而是匹配出來的是python。也可以理解成?開啟了非貪婪模式

  • 如果要開啟非貪婪模式,但是又不想用*,+ 去匹配無限次,而是指定匹配次數的范圍,那么可以這樣
    python{1,2}
    這表示,最多匹配2次,最少匹配1次
  • import re
    a = "pytho0python1pythonn2"
    res = re.findall("python{1,2}",a)
    print res
    
    # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
    # ['python', 'pythonn']

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM