Scrapy

scrapy框架是一個非常全面的爬蟲框架，可以說是爬蟲界的django了，里面有相當多的組件，格式化組件item，持久化組件pipeline，爬蟲組件spider

首先我們要先和django一樣先pip現在

Linux
    pip3 install scrapy

Windows
    a. pip3 install wheel
    b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    c. 進入下載目錄，執行 pip3 install Twisted-xxxxx.whl

    d. pip3 install scrapy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
    e. pip3 install pywin32  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

創建第一個scrapy程序

打開shell

創建scrapy項目

scrapy startproject xxx(項目名稱)

cd xianglong
scrapy genspider chouti chouti.com (這一步寫的url會在start_url中體現)
運行程序（帶有日志記錄）
scrapy crawl chouti
不帶有日志的打印
scrapy crawl chouti --nolog

import scrapyclass ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        print(response.text)

此處parse是一個回調函數，會把爬取到的結果封裝到response中傳給parse

如果我們想解析其中的數據，可以使用里面的內置模塊,不用bs4模塊了不然會有一種四不像的感覺

from scrapy.selector import HtmlXPathSelectoclass ChoutiSpider(scrapy.Spider): name = 'chouti'

 allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] def parse(self, response): """ 當起始URL下載完畢后，自動執行parse函數：response封裝了響應相關的所有內容。 :param response: :return: """ hxs = HtmlXPathSelector(response=response) # 去下載的頁面中：找新聞
　　　　　　
　　　　　# // 代表子子孫孫下找，div[@id='content-list'] div id是content-list 
　　　　　# / 兒子找， div class屬性是item

 items = hxs.xpath("//div[@id='content-list']/div[@class='item']") for item in items: href = item.xpath('.//div[@class="part1"]//a[1]/@href').extract_first() text = item.xpath('.//div[@class="part1"]//a[1]/text()').extract_first() item = XianglongItem(title=text,href=href) yield item pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract() for page_url in pages: page_url = "https://dig.chouti.com" + page_url yield Request(url=page_url,callback=self.parse)

如果yield 一個Item對象那么會去pipelines.py中去出里

要使用這個功能需要在settings文件中配置

item/pipelines
配置：
ITEM_PIPELINES = {
	'xianglong.pipelines.XianglongPipeline': 300,
}

items.py 中主要處理數據的格式化

import scrapy


class XianglongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    href = scrapy.Field()

持久化組件pipelines.py

class XianglongPipeline(object):

    def process_item(self, item, spider):
        self.f.write(item['href']+'\n')
        self.f.flush()

        return item

    def open_spider(self, spider):
        """
        爬蟲開始執行時，調用
        :param spider:
        :return:
        """
        self.f = open('url.log','w')

    def close_spider(self, spider):
        """
        爬蟲關閉時，被調用
        :param spider:
        :return:
        """
        self.f.close()

因為在持久化的時候我們需要對文件或者數據庫進行操作，我們可以在項目開始的就打開文件句柄或者數據庫連接，對文件進行操作

當我們查完這一頁的數據，我們得到了下一頁的頁碼，想讓爬蟲繼續爬。

我們可以這么設置

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from ..items import XianglongItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/',]

    def parse(self, response):
        """
        當起始URL下載完畢后，自動執行parse函數：response封裝了響應相關的所有內容。
        :param response:
        :return:
        """

        pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
        for page_url in pages:
            page_url = "https://dig.chouti.com" + page_url
            yield Request(url=page_url,callback=self.parse)

只要yield 一個Request對象就會繼續執行他設置的回調函數。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy框架的使用 Scrapy框架--使用cookie scrapy框架使用教程 Scrapy框架的安裝及使用 Scrapy框架基本使用爬蟲框架-scrapy的使用 scrapy框架+selenium的使用 scrapy框架之代理的使用爬蟲框架Scrapy的安裝與基本使用 JAVA 爬蟲框架webmagic 初步使用Demo