1. 實驗目的

掌握爬蟲工作的基本原理，並完成一定的任務。

1.1 編寫爬蟲腳本使其可以工作 1.2 完成批量爬取文本文章的任務（單一網站） 1.3 將文本文章轉存到mysql數據庫和項目文件夾中

2. 相關知識

2.1 python基礎知識學習

python3 字符串基本操作 |
python3 file操作 |
python3 os操作

2.2 python爬蟲知識學習

BeautifulSoup |
python 爬蟲介紹

2.3 pymysql的使用

python mysql-connector驅動 |
pymysql操作

2.4 其他相關

遇到的問題 |
bs4.select()

3. 爬蟲實現

3.1爬蟲初步實現

（1）我們爬取中國化工市場機械網，以下為相關代碼演示

import requests from bs4 import BeautifulSoup

    res = requests.get(addresses[i])
    res.encoding = 'GB18030' # 通過修改編碼方式為GB18030,兼容網站編碼（gb2312） # 這里的'html.parser'是為了告訴BeautifulSoup這個html_sample的解析形式是html格式# soup = BeautifulSoup(res.text, 'html.parser') article_content = soup.select( '#NewsMainLeft > div.mainBox.MarginTop10.articleBox > div.article > div.ArticleMatter') article_title = soup.select( '#NewsMainLeft > div.mainBox.MarginTop10.articleBox > div.article > div.articleTitle > h1')

此處select()中的內容，可以使用chrome瀏覽器的開發者模式，選中該標簽，右鍵copy->copy selector,再復制到select()中，更精確。

利用print()方法可以將爬下來的字段打印出來。

但是只可以爬取單一網址下的內容，如果想批量爬取改網站文章，就需要多次更改爬取的網址，不合理。

（2）我發現有兩個辦法可以實現批量的爬取

發現網址之間的規律，使用循環每次更改網址，但是使用中發現網址的變化有時會脫離規律，在運行過程中會出現一些錯誤，因此不推薦使用該方法。
另一種方法是：再爬取網站內的下一篇的<a>標簽內的href屬性，作為返回值，使用到下一次循環當中。如下代碼演示：

    next_address = soup.select(
        '#NewsMainLeft > div.mainBox.MarginTop10.articleBox > div.article > div.arNext > a[href]')

3.2 爬取文本初步整理

爬取下來的文本，我發現有一些位置出現不必要的字符、回車等，這些如果無法處理，將影響到后期存儲數據，故需要清除，代碼如下：

    for s in article_title: delete = str(s.contents) title = delete.replace('[\'', '').replace('\']', '').replace('\\r', '').replace('\\n', '').replace('\\t', '')\ .replace('\\', '').replace('/', '').replace(':', '').replace('*', '').replace('?', '').replace('\"', '')\ .replace('<', '').replace('>', '').replace('|', '') for t in article_content: delete = str(t.contents) context = delete.replace('[\'', '').replace('\']', '').replace('\\r', '').replace('\\n', '').replace('\\t', '')\ .replace('\\u3000', '').replace('\', <br/>,', '').replace('<br/>, \'', '').replace('<br/>,', '')\ .replace('<br/>', '').replace('</p>', '').replace('<p>', '').replace(' ', '').replace('\'', '').lstrip('\'') title_and_context = title+'。'+context if title_and_context[len(title_and_context)-1] == "\'": title_and_context = title_and_context[:len(title_and_context)-1] + ''

經過上述處理，文本信息初步處理完畢

　　　　　　　　　　　　　　　　　　　　　　　　⬆文章存儲如圖上sql語句內顯示⬆

3.3 文章存儲

（1）涉及編碼問題，首先，被爬取的網頁的編碼為gb2312，但是在爬取過程中，如：“槃”字仍無法識別報錯，我將爬蟲的爬取編碼設為gb18030，問題解決。gb18030是gb2312和gbk編碼擴大后的編碼格式，支持的漢字更多。

（2）數據庫也需要設置，通常，mysql默認建立數據庫和表的編碼是utf-8，在這里，我改成gb18030防止存入數據庫時出錯.

　　　　　　　　　　　　　　　　　　　　如上圖的設置

（3）保存為.txt

# 保存到文本文件當中 def save_files(path, curr_file_name, curr_content): if os.path.exists(path): # 判斷文件夾是否存在 os.chdir(path) # 進入文件夾 elif os.getcwd()[-len(path):] == path: print("本篇文章已存入") else: os.mkdir(path) # 創建文件夾 os.chdir(path) # 進入文件夾 f = open(curr_file_name, 'w', encoding='GB18030') f.write(curr_content) f.close() print(os.getcwd())

（4）保存到數據庫

建立數據庫連接

util.py
import mysql.connector def get_connect(curr_host, curr_user, curr_passwd, curr_database): my_db = mysql.connector.connect( host=curr_host, # 數據庫主機地址 user=curr_user, # 數據庫用戶名 passwd=curr_passwd, # 數據庫密碼 database=curr_database # 進入數據庫 ) my_cursor = my_db.cursor() return my_cursor, my_db

創建數據庫

import mysql.connector
# my_cursor.execute("CREATE DATABASE articles_db") # my_cursor.execute("USE articles_db") my_db = mysql.connector.connect( host="localhost", # 數據庫主機地址 user="root", # 數據庫用戶名 passwd="123", # 數據庫密碼 database="articles_db" # 進入數據庫 ) my_cursor = my_db.cursor() my_cursor.execute( "CREATE TABLE articles_tb (id INT AUTO_INCREMENT PRIMARY KEY, htmlId varchar(255), context MEDIUMTEXT)")

保存到數據庫中

# 保存到mysql中 def save_files_to_mysql(curr_file_name, curr_content): my_cursor, my_db = util.get_connect("localhost", "root", "123", "articles_db") sql_1 = "INSERT INTO articles_tb (htmlId,context)VALUES(\'" sql_2 = "\',\'" sql_3 = "\')" sql = sql_1+curr_file_name+sql_2+curr_content+sql_3 print("sql:" + sql) my_cursor.execute(sql) my_db.commit() # 提交到數據庫執行，必須一步勿忘 my_cursor.close() my_db.close

4.總結

4.1

我通過爬蟲，爬取到了一定量的數據，后面的計划是利用這些文本，經過一系列的操作，如數據清洗、三元組提取、知識圖譜的建立等，實現一個一定領域內的搜索功能。

4.2

關於爬蟲，有很多值得使用的框架，如pyspider、Scrapy等，后期學習之后會進行進一步的改進。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python實現爬蟲從網絡上下載文檔 Python網絡爬蟲(認識爬蟲) 什么是網絡爬蟲？為什么要選擇Python寫網絡爬蟲？ Python——網絡爬蟲網絡爬蟲實現網絡爬蟲-案例實現 java實現網絡爬蟲 python 網絡爬蟲介紹 Python網絡爬蟲練習 python實現網絡爬蟲下載天涯論壇帖子