spider（六）——多線程&scrapy

本文轉載自查看原文 2020-04-07 17:58 1589

Day05回顧
1、json模塊
1、json.loads()
    json格式(對象、數組) -> Python格式(字典、列表)
2、json.dumps()
    Python格式(字典、列表、元組) -> json格式(對象、數組)
2、Ajax動態加載
1、F12 -> Query String Data
2、params={QueryString中一堆的查詢參數}
3、URL地址：F12抓到的GET地址
3、selenium+phantomjs
1、phantomjs : 無界面瀏覽器(在內存中執行頁面加載)
2、使用步驟
    1、導入模塊
      from selenium import webdriver
    2、創建瀏覽器對象
      driver = webdriver.PhantomJS(executable_path='')
    3、發請求,獲取頁面信息
      driver.get(url)
    4、查找節點位置
      text = driver.find_element_by_class_name("")
    5、發送文字
      text.send_keys("")
    6、點擊
      button = driver.find_element_by_id("")
      button.click()
    7、關閉
      driver.quit()
3、常用方法
1、driver.get(url)
2、driver.page_source
3、driver.page_source.find("字符串")
    -1 ：沒有找到,失敗
4、driver.find_element_by_id("")
5、driver.find_element_by_name("")
6、driver.find_element_by_class_name("")
7、driver.find_enlement_by_xpath("")
8、對象名.send_keys("")
9、對象名.click()
10、對象名.text
11、driver.quit()
5、selenium+chromedriver
1、下載對應版本的安裝包
2、如何設置無界面模式
    option = webdriver.ChromeOpitons()
    option.set_headless()
    option.add_argument('windows-size=1920x3000')

    driver = webdriver.Chrome(options=option)
    driver.get(url)
*********************************
Day06筆記

一、京東商品抓取案例

見：01_京東商品抓取(執行JS腳本).py
1、目標
    商品名稱、商品價格、評論數量、商家名稱
2、xpath匹配每個商品的節點對象
    //div[@id="J_goodsList"]//li
3、關於下一頁
    下一頁按鈕(能點)   : class的值為pn-next
    下一頁按鈕(不能點) : class的值為pn-next disabled

二、多線程爬蟲

1、進程
    1、系統中正在運行的一個應用程序
    2、1個CPU核心1次只能執行1個進程,其他進程都屬於非運行狀態
    3、N個CPU核心可同時執行N個任務
2、線程
    1、進程中包含的執行單元,1個進程可包含多個線程
    2、線程使用所屬進程空間(1次只能執行1個線程,阻塞)
3、GIL ：全局解釋鎖
    執行通行證,僅此1個,誰拿到了通行證誰執行,否則等
4、應用場景
    1、多進程：大量的密集計算
    2、多線程：I/O操作密集
      爬蟲：網絡I/O密集
      寫文件：本地磁盤I/O
5、百思不得其姐多線程案例
    1、網址：http://www.budejie.com/1
    2、目標：段子內容
    3、xpath表達式
      //div[@class="j-r-list-c-desc"]/a/text()
    4、知識點
      1、隊列(from queue import Queue)
        put()
   get()
   Queue.empty() ：是否為空
   Queue.join() ：如果隊列為空,執行其他程序
      2、線程(import threading)
        threading.Thread(target=......)
    5、代碼實現

三、BeautifulSoup解析

1、定義：HTML或XML的解析器,依賴於lxml

2、安裝：python -m pip install beautifulsoup4

導入：from bs4 import BeautifulSoup

3、使用流程

    1、導模塊：from bs4 import BeautifulSoup
    2、創建解析對象
      　　soup = BeautifulSoup(html,'lxml')
    3、查找節點對象
      　　r_list = soup.find_all("div",attrs={"class":"test"})

4、見示例代碼

from bs4 import BeautifulSoup

html = '<div class="fengyun">雄霸</div>'
# 創建解析對象
soup = BeautifulSoup(html,'lxml')
r_list = soup.find_all("div",attrs={"class":"fengyun"})
# get_text()方法或string屬性來獲取節點的文本內容
for r in r_list:
    print(r.get_text())
    print(r.string)

03_bs4示例代碼1.py

from bs4 import BeautifulSoup

html = '''
<div class="test1">雄霸</div>
<div class="test1">幽若</div>
<div class="test2">
  <span>第二夢</span>
</div>
'''
# 找到所有class為test1的div節點的文本
soup = BeautifulSoup(html,'lxml')
r_list = soup.find_all("div",
                  attrs={"class":"test1"})
for r in r_list:
    print(r.get_text(),r.string)
# 找到class為test2的div節點下span節點的文本
r_list = soup.find_all("div",
                  attrs={"class":"test2"})
for r in r_list:
    print(r.span.string)

04_bs4示例代碼2.py

5、BeautifulSoup支持的解析庫

    1、lxml ：soup = BeautifulSoup(html,'lxml')
      速度快,文檔容錯能力強
    2、html.parser ：Python標准庫
      速度、容錯能力都一般
    3、xml
      速度快,文檔容錯能力強

6、節點選擇器

選擇節點並獲取內容
　　節點對象.節點名.string

7、find_all() ：返回列表

r_list = soup.find_all("節點名",attrs={"節點屬性名":"屬性值"})

優先推薦使用正則或者xpath，效率高一些。

import requests
from bs4 import BeautifulSoup

url = "https://www.qiushibaike.com/"
headers = {"User-Agent":"Mozilla/5.0"}

# 獲取頁面源碼
res = requests.get(url,headers=headers)
res.encoding = "utf-8"
html = res.text
# 創建解析對象並解析
soup = BeautifulSoup(html,'lxml')
r_list = soup.find_all("div",
                  attrs={"class":"content"})
# for循環遍歷
i = 1
for r in r_list:
    print(r.span.get_text().strip())
    i += 1
    print("*" * 30)

print(i)

05_糗事百科首頁bs4.py

Python中BeautifulSoup庫的用法

四、Scrapy框架

1、定義

異步處理框架,可配置和可擴展程度非常高,Python中使用最廣泛的爬蟲框架

2、安裝(Ubuntu)

　1、安裝依賴庫（包）
       　1、sudo apt-get install python3-dev
       　2、sudo apt-get install python-pip
       　3、sudo apt-get install libxml2-dev
       　5、sudo apt-get install libxslt1-dev
       　6、sudo apt-get install zlib1g-dev
       　7、sudo apt-get install libffi-dev
       　8、sudo apt-get install libssl-dev
　2、安裝Scrapy
       　sudo pip3 install Scrapy           ####注意此處S為大寫
　3、驗證 (python交互模式)
      　 >>>import scrapy　　####注意此處s為小寫
　4、創建項目出現警告解決方案
      　創建項目：scrary startproject AAA
        　　報錯："Warning : .... Cannot import OpenTpye ...."
        　　因為 pyasn1 版本過低,將其升級即可
        　　sudo pip3 install pyasn1 --upgrade

五、Scrapy框架五大組件

　1、引擎(Engine) ：整個框架的核心
　2、調度器(Scheduler)：接受從引擎發過來的URL,入隊列
　3、下載器(Downloader)：獲取網頁源碼,返給爬蟲程序
　4、下載器中間件(Downloader Middlewares)
　 蜘蛛中間件(Spider Middlewares)
　5、項目管道(Item Pipeline) ：數據處理

六、Scrapy框架詳細抓取流程

七、制作Scrapy爬蟲項目的步驟

　1、新建項目
    　　scrapy startproject 項目名
　2、明確目標(items.py)
　3、制作爬蟲程序
    　進入到spiders文件夾中,執行：
    　　　scrapy genspider 爬蟲名 "域名"
　　　　e.g : scrapy genspider baiduspider www.baidu.com
　4、處理數據(pipelines.py)
　5、項目全局配置settings.py
　6、運行爬蟲程序
    　　　scrapy crawl 爬蟲名

八、scrapy項目結構

Baidu
　├── Baidu ：項目目錄
　│   　　├── __init__.py
　│   　　├── items.py     　　：定義爬取數據結構
　│   　　├── middlewares.py ：下載器中間件和蜘蛛中間件
　│   　　├── pipelines.py   　：管道文件,處理數據
　│   　　├── settings.py    　：項目全局配置
　│   　　└── spiders       　　：文件夾，存放爬蟲程序
　│       　　　　├── baiduspider.py ：爬蟲程序/spider
　│       　　　　└──__init__.py
　│
　 └── scrapy.cfg ：項目基本配置文件,不用改

九、文件配置詳解

1、settings.py

    # 設置user_agent（自己改）
    USER_AGENT = 'Baidu (+http://www.yourdomain.com)'

    # 是否遵循robots協議,改為False
    ROBOTSTXT_OBEY = False

    # 最大並發量,默認為16個（沒有特殊需求不用改）
    CONCURRENT_REQUESTS = 6

    # 下載延遲時間
    DOWNLOAD_DELAY = 1

    # 請求報頭
    DEFAULT_REQUEST_HEADERS = {
    #　　'User-Agent':'Mozilla/5.0',
    　　'Accept':     'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    　　'Accept-Language': 'en',
    }

　# 蜘蛛中間件（不常用）
    SPIDER_MIDDLEWARES = {
       'Baidu.middlewares.BaiduSpiderMiddleware': 543,
    }

    # 下載器中間件
    DOWNLOADER_MIDDLEWARES = {
       'Baidu.middlewares.BaiduDownloaderMiddleware': 543,
    }

    # 項目管道,處理數據（重要）
    ITEM_PIPELINES = {
    　　'Baidu.pipelines.BaiduPipelineMySQL': 300, #300為優先級,1-1000，數字小,則優先級高。
    　　'Baidu.pipelines.BaiduPipelineMongo': 200,
    }

十、項目：抓取百度首頁源碼,存到百度.html中

1、scrapy startproject Baidu
2、cd Baidu/Baidu
3、subl items.py(此步不用操作)
4、cd spiders
5、scrapy genspider baidu "www.baidu.com"
6、subl baidu.py
    # 爬蟲名
    # 域名        : 重點檢查
    # start_urls : 重點檢查

    def parse(self,respose):
        with open("百度.html","w") as f:
        f.write(response.text)
7、cd ../
8、subl settings.py
    1、把robots改為False
    2、添加User-Agent
      DEFAULT_REQUEST_HEADERS = {
        　'User-Agent','Mozilla/5.0',
   　　... ...
      }
9、cd spiders 運行 scrapy crawl baidu

# -*- coding: utf-8 -*-
import scrapy


class BaiduSpider(scrapy.Spider):
    # 爬蟲名,運行爬蟲時的名字
    name = 'baidu'
    # 允許爬取的域名
    allowed_domains = ['www.baidu.com']
    # 開始要爬取的URL
    start_urls = ['http://www.baidu.com/']

    # parse函數名不能改
    def parse(self, response):
        with open("百度.html","w") as f:
            f.write(response.text)

baidu.py

# -*- coding: utf-8 -*-

# Scrapy settings for Baidu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Baidu'

SPIDER_MODULES = ['Baidu.spiders']
NEWSPIDER_MODULE = 'Baidu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Baidu (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'User-Agent':'Mozilla/5.0',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Baidu.middlewares.BaiduSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Baidu.middlewares.BaiduDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'Baidu.pipelines.BaiduPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy spider官方文檔 scrapy 在spider中處理超時 scrapy 為每個pipeline配置spider Scrapy 'module' object has no attribute 'Spider'錯誤 scrapy 讓指定的spider執行指定的pipeline Spider-scrapy斷點續爬 scrapy錯誤-[scrapy.core.scraper] ERROR: Spider error processing 關於 Scrapy 中自定義 Spider 傳遞參數問題 Scrapy多個spider情況下pipline、item設置第五篇 scrapy安裝及目錄結構，啟動spider項目