python爬蟲的圖片信息爬取

本文轉載自查看原文 2018-06-27 14:44 1565 圖片爬取/ python爬蟲的圖片下載

上一篇博客已經講述了對文本信息的爬取，本章將詳細說一下對圖片信息的爬取。

首先先看一下項目的目錄：

老規矩，根據代碼頁進行講解：(本次只針對一個頁面進行講解，多頁面爬取只需解除注釋即可)

kgcspider.py

# -*- coding: utf-8 -*-
import scrapy
from kgc.items import *


class KgcspiderSpider(scrapy.Spider):
    name = 'kgcspider'
    #allowed_domains = ['http://www.kgc.cn/list/230-1-6-9-9-0.shtml']
    start_urls = ['http://www.kgc.cn/list/230-1-6-9-9-0.shtml']

    def parse(self, response):
        #print(response.body.decode())
        title = response.css('a.yui3-u.course-title-a.ellipsis::text').extract()
        price=response.css('div.right.align-right>span::text').extract()
        persons=response.css('span.course-pepo::text').extract()
        image_urls=response.css('a.kgc-w>img::attr("src")').extract()
        #print(title)
        datas=zip(title,price,persons,image_urls)
        for d in datas:
            item=KgcItem()
            item['title']=d[0]
            item['price']=d[1]
            item['persons']=d[2]
            item['image_urls']=[d[3]]
            yield  item
        # next_url=response.css('li.next>a::attr("href")').extract_first()
        #
        # if next_url is not None:
        #     yield response.follow(next_url,self.parse)

精解：對於之前的文本內容的爬取代碼保持不變，增加的圖片的爬取路徑image_urls,也對其進行循環輸出，並且放到item中。

item.py

import scrapy
class KgcItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    price=scrapy.Field()
    persons=scrapy.Field()

    image_urls=scrapy.Field()
    images=scrapy.Field()

精解：在實體類item中，加入存儲的field，並且對圖片images進行存取。images存儲的時圖片的一些存儲路徑path，爬取路徑URL等，后期可以根據path查詢圖片。

piplines.py

class KgcPipeline(object):
    def open_spider(self,spider):
    #當蜘蛛啟動時自動執行
        self.file=open("/home/yzhl/IdeaProjects/kgc/kgc.csv","w",encoding='utf8')
    def process_item(self, item, spider):
    #蜘蛛每yild一個item，執行一次
        line=item["title"]+","+item["price"]+','+item["persons"]+','+item["images_urls"]+'\n'
        self.file.write(line)
        return item
    def close_spider(self,spider):
    #蜘蛛完成工作關閉執行
        self.file.close()

精解：當啟動蜘蛛后，這個kgc.csv文件的類型已經不再適用，item.py只對其執行yield的item，所以就需要對setting文件進行配置了。

setting.py

ITEM_PIPELINES = {
   'kgc.pipelines.KgcPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE='/home/yzhl/kgcimages'

精解：在intem_pipelines中加入scrapy關於圖片images的管道，同時還要在實體類item中寫入關於store的路徑，路徑即存儲圖片的文件夾的路徑，這樣下載的圖片就會依次存入到文件夾的目錄下。

路徑獲取：cd到當前目錄下，pwd查找當前路徑。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲爬取全球機場信息 Python爬蟲學習(三) ——————爬取外賣信息 python 爬蟲 booking爬取酒店信息 Python爬蟲將爬取的信息變為字典【Python爬蟲】之爬取頁面內容、圖片以及用selenium爬取 Python爬蟲功能（爬取網頁圖片） Python爬蟲——爬取網頁圖片 Python爬蟲爬取網頁圖片 Python 爬蟲爬取煎蛋網圖片 Python爬蟲之——爬取妹子圖片