【python】爬蟲編寫--簡單的文字爬蟲

本文轉載自查看原文 2019-12-09 15:29 558 python

自己動手的第一個python爬蟲，腳本如下：

 1 #!/usr/bin/python
 2 # -*- coding: UTF-8 -*-
 3 import requests
 4 import re
 5 # 下載一個網頁
 6 url = 'http://www.jingcaiyuedu8.com/novel/BaJoa2/list.html'
 7 # 模擬瀏覽器發送http請求
 8 response = requests.get(url)
 9 # 編碼方式
10 response.encoding='utf-8'
11 # 目標小說主頁的網頁源碼
12 html = response.text
13 # 小說的名字
14 title = re.findall(r'<h1>(.*?)</h1>',html)[0]
15 # 新建一個文件，保存小說
16 fb = open('%s.txt' % title, 'w', encoding='utf-8')
17 # 獲取每一章的信息（章節，url）
18 dl = re.findall(r'<dl class="panel-body panel-chapterlist">.*?</dl>', html, re.S)[0]
19 chapter_info_list = re.findall(r'href="/novel/BaJoa2(.*?)">(.*?)<',dl)
20 # 循環章節，下載
21 for chapter_info in chapter_info_list:
22     # chapter_title = chapter_info[0]
23     # chapter_url = chapter_info[1]
24     chapter_url,chapter_title = chapter_info
25     chapter_url="http://www.jingcaiyuedu8.com/novel/BaJoa2/%s" % chapter_url
26     # 下載章節內容
27     chapter_response = requests.get(chapter_url)
28     chapter_response.encoding = 'utf-8'
29     chapter_html = chapter_response.text
30     # 提取章節內容
31     chapter_content = str(re.findall(r'<br />&nbsp;&nbsp;&nbsp;&nbsp;(.*?)<p>', chapter_html, re.S))
32     # 數據整理
33     chapter_content = str(chapter_content.replace(r'\r<br />\r<br />&nbsp;&nbsp;&nbsp;&nbsp;','\n'))
34     # 保存文檔
35     fb.write(chapter_title)
36     fb.write('\n')
37     fb.write(chapter_content)
38     fb.write('\n'*2)
39 fb.close()

1、編寫爬蟲思路：

　　確定下載目標，找到網頁，找到網頁中需要的內容。對數據進行處理。保存數據。

2、知識點說明：

　　1）確定網絡中需要的信息，打開網頁后使用F12打開開發者模式。

在Network中可以看到很多信息，我們在頁面上看到的文字信息都保存在一個html文件中。點擊文件后可以看到response，文字信息都包含在response中。

對於需要輸入的信息，可以使用ctrl+f，進行搜索。查看信息前后包含哪些特定字段。

對於超鏈接的提取，可以使用最左邊的箭頭點擊超鏈接，這時Elements會打開有該條超鏈接的信息，從中判斷需要提取的信息。從下載小說來看，在目錄頁提取出小說的鏈接和章節名。

　　2）注意編碼格式

輸入字符集一定要設置成utf-8。頁面大多為GBK字符集。不設置會亂碼。

　　3）正則匹配

r'內容' 內容里默認不需要轉義，但是（）這種可能有功能的符號前面需要加轉義符"\"。

.*?表示所有匹配。沒有（）時，會輸出含前后分割符匹配到的信息。帶（）時只會輸出（）中匹配到的內容。

末尾的re.S。表示使 . 匹配包括換行在內的所有字符

　　4）replace的使用

replace只能對字符串處理，上面的函數默認為list，不能直接使用，使用str()把函數轉換后可以使用replace。

還有正則re.sub支持字符串替換。可以不使用replace。

3、未完待續

　　腳本中的網站已經具備反扒功能，當相同IP連接數過多的時候會屏蔽連接的IP。本編只是最基本的爬蟲，后續繼續更新。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用python編寫一個壁紙網站的簡單爬蟲在python3中使用urllib.request編寫簡單的網絡爬蟲 python-re(正則)實現簡單爬蟲實例（文字、圖片、視頻）簡單的python爬蟲實例 Python簡單爬蟲 Python簡單爬蟲入門二 python 爬蟲簡單的demo python3.7簡單的爬蟲 python實現簡單爬蟲功能 python3.6 簡單爬蟲