python簡單爬蟲(二)

本文轉載自查看原文 2018-04-18 21:28 6325 python

　　上一篇簡單的實現了獲取url返回的內容，在這一篇就要第返回的內容進行提取，並將結果保存到html中。

一、需求:

　　抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標簽內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標簽的分析(是a標簽，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器
　　1、urllib2 Python官方基礎模塊
　　2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')

#獲取狀態碼，如果是200表示成功
code = response.getcode()

#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法
然后 URLlib.urlopen(request)

import urllib2

#創建Request對象
request = urllin2.Request(url)

#添加數據
request.add_data('a'.'1')

#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')

#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie
或者需要代理才能訪問使用ProxyHandler
或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類
　　1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配
　　2、html.parser python自帶模塊
　　3、BeautifulSoup 第三方插件
　　4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

　　BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4
　　　　-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之后已經加載完DOM樹
即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）
得到節點之后可以訪問節點名稱、屬性、文字

如：
<a href="123.html" class="aaa">Python</a>
可根據：
節點名稱：a
節點屬性：href="123.html" class="aaa"
節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象
soup = BeautifulSoup(
　　html_doc, #HTML文檔字符串
　　'html.parser' #HTML解析器
　　from_encoding='utf-8' #HTML文檔編碼
)

搜索節點：
方法：find_all(name,attrs,string)

#查找所有標簽為a的節點
　　soup.find_all('a')

#查找所有標簽為a，鏈接符合/view/123.html形式的節點
　　soup.find_all('a',href='/view/123.html')
　　soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標簽為div，class為abc，文字為Python的節點
　　soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免沖突

由於class是python的關鍵字，所以講class屬性加了個下划線。

訪問節點信息：
　　得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標簽名稱
　　node.name
#獲取查找到的節點的href屬性
　　node['href']
#獲取查找到的節點的連接文字
　　node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer


class SpiderMain(object):
    def __init__(self):
        self.urlManager = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownLoader()
        self.parser = html_parser.HtmpParser()
        self.outputer = html_outputer.HtmlOutpter()


    def craw(self,url):
        count = 1 #定義爬取幾個頁面
        self.urlManager.add_new_url(url)
        while self.urlManager.has_new_url():
            try:
                # 獲取一個url
                new_url = self.urlManager.get_new_url()
                # 訪問url，獲取網站返回數據
                html_content = self.downloader.download(new_url)
                new_urls, new_datas = self.parser.parse(new_url, html_content)
                self.urlManager.add_new_urls(new_urls)
                self.outputer.collect_data(new_datas)
                print(count)
                if count == 5:
                    break
                count = count+1
            except Exception as e:
                print("發生錯誤",e)
        # 將爬取結果輸出到html
        self.outputer.out_html()

if __name__=="__main__":
    url = 'https://baike.baidu.com/item/Python/407313'
    sm = SpiderMain()
    sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
    def __init__(self):
        # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
        self.new_urls = set()
        self.old_urls = set()

    # 添加一個url的方法
    def add_new_url(self,url):
        if url is None:
            return  None
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
    def has_new_url(self):
        return len(self.new_urls) != 0

    # 定義獲取一個新的url的方法
    def get_new_url(self):
        if len(self.new_urls)>0:
            # 從new_urls彈出一個並添加到old_urls中
            new_url = self.new_urls.pop()
            self.old_urls.add(new_url)
            return new_url

    # 批量添加url的方法
    def add_new_urls(self, new_urls):
        if new_urls is None:
            return
        for url in new_urls:
            self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request


class HtmlDownLoader(object):
    def download(self, url):
        if url is None:
            return
        # 訪問url
        response = urllib.request.urlopen(url)
        # 如果返回的狀態碼不是200代表異常
        if response.getcode() != 200:
            return
        return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib

from bs4 import BeautifulSoup


class HtmpParser(object):
    # 解析讀取到的網頁的方法
    def parse(self, new_url, html_content):
        if html_content is None:
            return
        soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
        new_urls = self.get_new_urls(new_url,soup)
        new_datas = self.get_new_datas(new_url,soup)
        return new_urls, new_datas


    # 獲取new_urls的方法
    def get_new_urls(self, new_url, soup):
        new_urls = set()
        # 查找網頁的a標簽，而且href包含/item
        links = soup.find_all('a',href=re.compile(r'/item'))
        for link in links:
            # 獲取到a必去哦啊Ian的href屬性
            url = link['href']
            # 合並url。使爬到的路徑變為全路徑，http://....的格式
            new_full_url = urllib.parse.urljoin(new_url,url)
            new_urls.add(new_full_url)
        return new_urls



    # 獲取new_data的方法
    def get_new_datas(self, new_url, soup):
        new_datas = {}
        # 獲取標題內容
        title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
        new_datas['title'] = title_node.get_text()

        #獲取簡介內容
        summary_node = soup.find('div',class_='lemma-summary')
        new_datas['summary'] = summary_node.get_text()

        new_datas['url'] = new_url

        return new_datas

html_outputer.py

class HtmlOutpter(object):
    # 構造方法
    def __init__(self):
        self.datas = []

    # 收集數據的方法
    def collect_data(self, new_datas):
        if new_datas is None:
            return
        # 如果數據不為空就講數據添加datas集合中
        self.datas.append(new_datas)

    # 輸出爬取到的數據到本地磁盤中
    def out_html(self):
        if self.datas is None:
            return
        file = open('C:\\Users\\liqiang\\Desktop\\實習\\python\\pythonCraw\\out.html', 'w', encoding='utf-8')
        file.write("<html>")
        file.write("<head>")
        file.write("<title>爬取結果</title>")
        # 設置表格顯示邊框
        file.write(r'''
        <style>
         table{width:100%;table-layout: fixed;word-break: break-all; word-wrap: break-word;}
         table td{border:1px solid black;width:300px}
        </style>
        ''')
        file.write("</head>")
        file.write("<body>")
        file.write("<table cellpadding='0' cellspacing='0'>")
        # 遍歷datas填充到表格中
        for data in self.datas:
            file.write("<tr>")
            file.write('<td><a href='+str(data['url'])+'>'+str(data['url'])+'</a></td>')
            file.write("<td>%s</td>" % data['title'])
            file.write("<td>%s</td>" % data['summary'])
            file.write("</tr>")
        file.write("</table>")
        file.write("</body>")
        file.write("</html>")

運行spider.py的主函數:(結果會將提取到的結果保存到html中)

總結:

　　python的類類似於java，繼承object

　　python的返回值return和return None一樣(None類似於java的null關鍵字)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 簡單的python爬蟲實例 Python簡單爬蟲 Python簡單爬蟲入門二 python 爬蟲簡單的demo python3簡單爬蟲 python簡單爬蟲 python豆瓣的簡單爬蟲 Python簡單爬蟲入門一 Python開發簡單爬蟲簡單python爬蟲實例