Xpath re bs4 等爬蟲解析器的性能比較


xpath re bs4 等爬蟲解析器的性能比較

本文原始地址:https://sitoi.cn/posts/23470.html

思路

測試網站地址:http://baijiahao.baidu.com/s?id=1644707202199076031

根據同一個網站,獲取同樣的數據,重復 500 次取和后進行對比。

測試例子

# -*- coding: utf-8 -*-
import re
import time

import scrapy
from bs4 import BeautifulSoup


class NewsSpider(scrapy.Spider):
    name = 'news'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baijiahao.baidu.com/s?id=1644707202199076031']

    def parse(self, response):
        re_time_list = []
        xpath_time_list = []
        lxml_time_list = []
        bs4_lxml_time_list = []
        html5lib_time_list = []
        bs4_html5lib_time_list = []
        for _ in range(500):
            # re
            re_start_time = time.time()
            news_title = re.findall(pattern="<title>(.*?)</title>", string=response.text)[0]
            news_content = "".join(re.findall(pattern='<span class="bjh-p">(.*?)</span>', string=response.text))
            re_time_list.append(time.time() - re_start_time)
            # xpath
            xpath_start_time = time.time()
            news_title = response.xpath("//div[@class='article-title']/h2/text()").extract_first()
            news_content = response.xpath('string(//*[@id="article"])').extract_first()
            xpath_time_list.append(time.time() - xpath_start_time)
            # bs4 html5lib without BeautifulSoup
            soup = BeautifulSoup(response.text, "html5lib")
            html5lib_start_time = time.time()
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            html5lib_time_list.append(time.time() - html5lib_start_time)
            # bs4 html5lib with BeautifulSoup
            bs4_html5lib_start_time = time.time()
            soup = BeautifulSoup(response.text, "html5lib")
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            bs4_html5lib_time_list.append(time.time() - bs4_html5lib_start_time)

            # bs4 lxml without BeautifulSoup
            soup = BeautifulSoup(response.text, "lxml")
            lxml_start_time = time.time()
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            lxml_time_list.append(time.time() - lxml_start_time)

            # bs4 lxml without BeautifulSoup
            bs4_lxml_start_time = time.time()
            soup = BeautifulSoup(response.text, "lxml")
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            bs4_lxml_time_list.append(time.time() - bs4_lxml_start_time)
        re_result = sum(re_time_list)
        xpath_result = sum(xpath_time_list)
        lxml_result = sum(lxml_time_list)
        html5lib_result = sum(html5lib_time_list)
        bs4_lxml_result = sum(bs4_lxml_time_list)
        bs4_html5lib_result = sum(bs4_html5lib_time_list)

        print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n")
        print(f"re 使用時間:{re_result}")
        print(f"xpath 使用時間:{xpath_result}")
        print(f"lxml 純解析使用時間:{lxml_result}")
        print(f"html5lib 純解析使用時間:{html5lib_result}")
        print(f"bs4_lxml 轉換解析使用時間:{bs4_lxml_result}")
        print(f"bs4_html5lib 轉換解析使用時間:{bs4_html5lib_result}")
        print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n")
        print(f"xpath/re :{xpath_result / re_result}")
        print(f"lxml/re :{lxml_result / re_result}")
        print(f"html5lib/re :{html5lib_result / re_result}")
        print(f"bs4_lxml/re :{bs4_lxml_result / re_result}")
        print(f"bs4_html5lib/re :{bs4_html5lib_result / re_result}")
        print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")

測試結果:

第一次

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用時間:0.018010616302490234
xpath 使用時間:0.19927382469177246
lxml 純解析使用時間:0.3410227298736572
html5lib 純解析使用時間:0.3842911720275879
bs4_lxml 轉換解析使用時間:1.6482152938842773
bs4_html5lib 轉換解析使用時間:6.744122505187988

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :11.064242408196765
lxml/re :18.934539726245003
html5lib/re :21.336925154218847
bs4_lxml/re :91.51354213550078
bs4_html5lib/re :374.4526223822509
lxml/xpath :1.7113272673976896
html5lib/xpath :1.9284578525152096

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

第二次

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用時間:0.023047208786010742
xpath 使用時間:0.18992280960083008
lxml 純解析使用時間:0.3522317409515381
html5lib 純解析使用時間:0.418229341506958
bs4_lxml 轉換解析使用時間:1.710503101348877
bs4_html5lib 轉換解析使用時間:7.1153998374938965

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :8.24059917034769
lxml/re :15.28305419636484
html5lib/re :18.14663742538819
bs4_lxml/re :74.21736476770769
bs4_html5lib/re :308.7315216154427
lxml/xpath :1.8546047296364272
html5lib/xpath :2.2021016979791463

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

第三次

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用時間:0.014002561569213867
xpath 使用時間:0.18992352485656738
lxml 純解析使用時間:0.3783881664276123
html5lib 純解析使用時間:0.39995455741882324
bs4_lxml 轉換解析使用時間:1.751767873764038
bs4_html5lib 轉換解析使用時間:7.1871068477630615

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :13.563484360899695
lxml/re :27.022781835827757
html5lib/re :28.56295653062267
bs4_lxml/re :125.10338662716453
bs4_html5lib/re :513.2708620660298
lxml/xpath :1.9923185751389976
html5lib/xpath :2.1058716013241323

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

結果分析:

三次取平均值結果分析

re xpath lxml html5lib lxml(bs4) html5lib(bs4)
re 1 10.52 19.46 21.84 92.82 382.25
xpath 1 1.85 2.08 8.82 36.34
lxml 1 1.12 4.77 19.64
html5lib 1 4.25 17.50
lxml(bs4) 1 4.12
html5lib(bs4) 1
  • xpath/re :10.52
  • lxml/re :19.46
  • html5lib/re :21.84
  • bs4_lxml/re :92.82
  • bs4_html5lib/re :382.25
  • lxml/xpath :1.85
  • html5lib/xpath :2.08
  • bs4_lxml/xpath :8.82
  • bs4_html5lib/xpath :36.34
  • html5lib/lxml :1.12
  • bs4_lxml/lxml :4.77
  • bs4_html5lib/lxml :19.64
  • bs4_lxml/html5lib :4.25
  • bs4_html5lib/html5lib :17.50
  • bs4_html5lib/bs4_lxml :4.12

三種爬取方式的對比

re xpath bs4
安裝 內置 第三方 第三方
語法 正則 路徑匹配 面向對象
使用 困難 較困難 簡單
性能 最高 適中 最低

結論

re > xpath > bs4

  • re 是 xpath 的 10 倍左右

    雖然 re 在性能上遠比 xpath bs4 高很多,但是在使用上,比 xpath 和 bs4 難度上要大很多,且后期維護的困難度上也高很多。

  • xpath 是 bs4 的 1.8 倍左右

    僅僅比較提取的效率來說,xpath 是 bs4 的 1.8 倍左右,但是實際情況還包含 bs4 的 轉換過程,在層數多且量大的情況下,實際效率 xpath 要比 bs4 高很多。

總的來說,xpath 加上 scrapy-redis 的分布式已經非常滿足性能要求了,建議入 xpath 的坑。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM