正則表達式練習

本文轉載自查看原文 2012-11-22 22:42 3224 正則表達式/ Python

獲取網頁中文本信息

試驗中用到www.17k.com的資源，參考了http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html的博文。

 1 from urllib import urlopen
 2 import re
 3 
 4 # 將正則表達式編譯成Pattern對象
 5 # re.S(DOTALL): 點任意匹配模式，改變'.'的行為。不加匹配不到內容？
 6 p = re.compile(r'<div class="p" id="chapterContent">(.*?)<p class="recent_read"', re.S)
 7 
 8 # 從指定的URL讀取內容
 9 text = urlopen(r'http://www.17k.com/chapter/317131/7299531.html').read()
10 
11 # 搜索string，以列表形式返回全部能匹配的子串，並連接
12 str = ''
13 for m in p.findall(text):                                                                   
14     str += m
15 
16 # sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):
17 # 使用repl替換string中每一個匹配的子串后返回替換后的字符串。
18 # 當repl是一個字符串時，可以使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。 
19 # 當repl是一個方法時，這個方法應當只接受一個參數（Match對象），並返回一個字符串用於替換（返回的字符串中不能再引用分組）。 
20 # count用於指定最多替換次數，不指定時全部替換。 
21 p1 = re.compile('(?P<pre>^|<br>)')  
22 print p1.sub(r'\n', str)

 1 from urllib import urlopen
 2 import re
 3 
 4 p = re.compile(r'<div class="p" id="chapterContent">(.*?)<p class="recent_read"', re.S)
 5 
 6 text = urlopen(r'http://www.17k.com/chapter/317131/7299531.html').read()
 7 
 8 str = ''
 9 for m in p.findall(text):                                                                   
10     str += m
11 
12 
13 str = str.replace('<br>', '\n')
14 
15 print str

。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux正則表達式練習 Java正則表達式練習 Python練習---正則表達式 js正則表達式練習常見正則表達式分析練習 Java算法練習——正則表達式匹配 Python實踐練習：strip()的正則表達式版本 python-正則表達式練習題 JS正則表達式練習題 python 正則表達式練習題