Python網絡爬蟲筆記（三）：下載博客園隨筆到Word文檔

本文轉載自查看原文 2018-04-10 18:36 1644 Python網絡爬蟲

（一）說明

在上一篇的基礎上修改了下，使用lxml提取博客園隨筆正文內容，並保存到Word文檔中。

操作Word文檔會用到下面的模塊：

pip install python-docx

修改的代碼（主要是在link_crawler()的while循環中增加了下面這段）

 1         tree = lxml.html.fromstring(html) #解析HTML為統一的格式
 2         title = tree.xpath('//a[@id="cb_post_title_url"]') #獲取標題
 3         the_file = tree.xpath('//div[@id="cnblogs_post_body"]/p') #獲取正文內容
 4         pre = tree.xpath('//pre') #獲取隨筆代碼部分（使用博客園自帶插入代碼功能插入的）
 5         img = tree.xpath('//div[@id="cnblogs_post_body"]/p/img/@src') #獲取圖片
 6         #修改工作目錄
 7         os.chdir('F:\Python\worm\博客園文件')
 8         #創建一個空白新的Word文檔
 9         doc = docx.Document()
10         #添加標題
11         doc.add_heading(title[0].text_content(), 0)
12         for i in the_file:
13             #將每一段的內容添加到Word文檔（p標簽的內容）
14             doc.add_paragraph(i.text_content())
15         # 將代碼部分添加到文檔中
16         for p in pre:
17             doc.add_paragraph(p.text_content())
18         #將圖片添加到Word文檔中
19         for i in img:
20             ure.urlretrieve(i, '0.jpg')
21             doc.add_picture('0.jpg')
22         #截取標題的前8位作為Word文件名
23         filename = title[0].text_content()[:8] + '.docx'
24         #保存Word文檔
25         #如果文件名已經存在，將文件名設置為title[0].text_content()[:8]+ str(x).docx，否則將文件名設置為filename
26         if str(filename) in os.listdir('F:\Python\worm\博客園文件'):
27             doc.save(title[0].text_content()[:8] + str(x) + '.docx')
28             x += 1
29         else:
30             doc.save(filename)

（二）完整代碼（delayed.py的代碼就不貼出來了，和上一篇一樣）

限速最好設置大一些，下面這句，以秒為單位。

waitFor = WaitFor(2)

 1 import urllib.request as ure
 2 import re
 3 import urllib.parse
 4 from delayed import WaitFor
 5 import lxml.html
 6 import os
 7 import docx
 8 #下載網頁並返回HTML(動態加載的部分下載不了)
 9 def download(url,user_agent='FireDrich',num=2):
10     print('下載:'+url)
11     #設置用戶代理
12     headers = {'user_agent':user_agent}
13     request = ure.Request(url,headers=headers)
14     try:
15         #下載網頁
16         html = ure.urlopen(request).read()
17     except ure.URLError as e:
18         print('下載失敗'+e.reason)
19         html=None
20         if num>0:
21             #遇到5XX錯誤時，遞歸調用自身重試下載，最多重復2次
22             if hasattr(e,'code') and 500<=e.code<600:
23                 return download(url,num-1)
24     return html
25 #seed_url傳入一個url，例如https://www.cnblogs.com/
26 #link_regex傳入一個正則表達式
27 #函數功能：提取和link_regex匹配的所有網頁鏈接並下載
28 def link_crawler(seed_url, link_regex):
29     html = download(seed_url)
30     crawl_queue = []
31     #迭代get_links（）返回的列表，將匹配正則表達式link_regex的鏈接添加到列表中
32     for link in get_links(html):
33         if re.match(link_regex, link):
34             #拼接https://www.cnblogs.com/ 和 /cate/...
35             link = urllib.parse.urljoin(seed_url, link)
36             #不在列表中才添加
37             if link not in crawl_queue:
38                 crawl_queue.append(link)
39     x = 0
40     #調用WaitFor的wait（）函數，下載限速，間隔小於2秒則等待,直到間隔等於2秒才繼續下載（大於5秒則直接繼續下載）
41     waitFor = WaitFor(2)
42     #下載crawl_queue中的所有網頁
43     while crawl_queue:
44         #刪除列表末尾的數據
45         url = crawl_queue.pop()
46         waitFor.wait(url)
47         html = download(url)
48         tree = lxml.html.fromstring(html) #解析HTML為統一的格式
49         title = tree.xpath('//a[@id="cb_post_title_url"]') #獲取標題
50         the_file = tree.xpath('//div[@id="cnblogs_post_body"]/p') #獲取正文內容
51         pre = tree.xpath('//pre') #獲取隨筆代碼部分（使用博客園自帶插入代碼功能插入的）
52         img = tree.xpath('//div[@id="cnblogs_post_body"]/p/img/@src') #獲取圖片
53         #修改工作目錄
54         os.chdir('F:\Python\worm\博客園文件')
55         #創建一個空白新的Word文檔
56         doc = docx.Document()
57         #添加標題
58         doc.add_heading(title[0].text_content(), 0)
59         for i in the_file:
60             #將每一段的內容添加到Word文檔（p標簽的內容）
61             doc.add_paragraph(i.text_content())
62         # 將代碼部分添加到文檔中
63         for p in pre:
64             doc.add_paragraph(p.text_content())
65         #將圖片添加到Word文檔中
66         for i in img:
67             ure.urlretrieve(i, '0.jpg')
68             doc.add_picture('0.jpg')
69         #截取標題的前8位作為Word文件名
70         filename = title[0].text_content()[:8] + '.docx'
71         #保存Word文檔
72         #如果文件名已經存在，將文件名設置為title[0].text_content()[:8]+ str(x).docx，否則將文件名設置為filename
73         if str(filename) in os.listdir('F:\Python\worm\博客園文件'):
74             doc.save(title[0].text_content()[:8] + str(x) + '.docx')
75             x += 1
76         else:
77             doc.save(filename)
78 #傳入html對象，以列表形式返回所有鏈接
79 def get_links(html):
80     #使用正則表達式提取html中所有網頁鏈接
81     webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']',re.IGNORECASE)
82     html = html.decode('utf-8')
83     # 以列表形式返回所有網頁鏈接
84     return webpage_regex.findall(html)
85 
86 link_crawler('https://www.cnblogs.com/cate/python/','.*/www.cnblogs.com/.*?\.html$')

（三）結果

（四）存在的問題

（1）代碼部分是添加到正文內容后面的。（使用過博客園插入代碼功能的隨筆，排版會不一致）

（2）圖片是直接插入到代碼部分后面的。（隨筆有插入圖片的，排版會不一致）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Word2013發布隨筆到博客園博客園隨筆導出CHM Python簡單爬蟲爬取自己博客園所有文章利用Word來發布博客到博客園博客園美化筆記 .NET輕松寫博客園爬蟲 Python爬蟲入門教程——爬取自己的博客園博客 Python爬蟲-博客園首頁推薦博客排行(整合詞雲+郵件發送) 博客園已經寫好的文章或者隨筆如何歸類 MetaWeblog API - 同步博客園隨筆的工具