Python使用Scrapy爬蟲框架全站爬取圖片並保存本地(@妹子圖@)


大家可以在Github上clone全部源碼。

Github:https://github.com/williamzxl/Scrapy_CrawlMeiziTu

Scrapy官方文檔:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

基本上按照文檔的流程走一遍就基本會用了。

 

Step1:

在開始爬取之前,必須創建一個新的Scrapy項目。 進入打算存儲代碼的目錄中,運行下列命令:

scrapy startproject CrawlMeiziTu

該命令將會創建包含下列內容的 tutorial 目錄:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/ __init__.py items.py pipelines.py settings.py
     middlewares.py spiders/ __init__.py ...
cd CrawlMeiziTu
scrapy genspider Meizitu http://www.meizitu.com/a/list_1_1.html

該命令將會創建包含下列內容的 tutorial 目錄:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/
     __init__.py items.py pipelines.py settings.py
     middlewares.py spiders/
       Meizitu.py __init__.py ...
我們主要編輯的就如下圖箭頭所示:


main.py是后來加上的,加了兩條命令,
1 from scrapy import cmdline
2 
3 cmdline.execute("scrapy crawl Meizitu".split())
主要為了方便運行。

Step2:編輯Settings,如下圖所示
 1 BOT_NAME = 'CrawlMeiziTu'
 2 
 3 SPIDER_MODULES = ['CrawlMeiziTu.spiders']
 4 NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders'
 5 ITEM_PIPELINES = {
 6    'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300,
 7 }
 8 IMAGES_STORE = 'D://pic2'
 9 DOWNLOAD_DELAY = 0.3
10 
11 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
12 ROBOTSTXT_OBEY = True

主要設置USER_AGENT,下載路徑,下載延遲時間




Step3:編輯Items.
Items主要用來存取通過Spider程序抓取的信息。由於我們爬取妹子圖,所以要抓取每張圖片的名字,圖片的連接,標簽等等
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CrawlmeizituItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #title為文件夾名字
    title = scrapy.Field()
    url = scrapy.Field()
    tags = scrapy.Field()
    #圖片的連接
    src = scrapy.Field()
    #alt為圖片名字
    alt = scrapy.Field()

 




Step4:編輯Pipelines
Pipelines主要對items里面獲取的信息進行處理。比如說根據title創建文件夾或者圖片的名字,根據圖片鏈接下載圖片。
# -*- coding: utf-8 -*-
import os
import requests
from CrawlMeiziTu.settings import IMAGES_STORE

class CrawlmeizituPipeline(object):

    def process_item(self, item, spider):
        fold_name = "".join(item['title'])
        header = {
            'USER-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
            'Cookie': 'b963ef2d97e050aaf90fd5fab8e78633',
            #需要查看圖片的cookie信息,否則下載的圖片無法查看
        }
        images = []
        # 所有圖片放在一個文件夾下
        dir_path = '{}'.format(IMAGES_STORE)
        if not os.path.exists(dir_path) and len(item['src']) != 0:
            os.mkdir(dir_path)
        if len(item['src']) == 0:
            with open('..//check.txt', 'a+') as fp:
                fp.write("".join(item['title']) + ":" + "".join(item['url']))
                fp.write("\n")

        for jpg_url, name, num in zip(item['src'], item['alt'],range(0,100)):
            file_name = name + str(num)
            file_path = '{}//{}'.format(dir_path, file_name)
            images.append(file_path)
            if os.path.exists(file_path) or os.path.exists(file_name):
                continue

            with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f:
                req = requests.get(jpg_url, headers=header)
                f.write(req.content)

        return item

 




Step5:編輯Meizitu的主程序。
最重要的主程序:
# -*- coding: utf-8 -*-
import scrapy
from CrawlMeiziTu.items import CrawlmeizituItem
#from CrawlMeiziTu.items import CrawlmeizituItemPage
import time
class MeizituSpider(scrapy.Spider):
    name = "Meizitu"
    #allowed_domains = ["meizitu.com/"]

    start_urls = []
    last_url = []
    with open('..//url.txt', 'r') as fp:
        crawl_urls = fp.readlines()
        for start_url in crawl_urls:
            last_url.append(start_url.strip('\n'))
    start_urls.append("".join(last_url[-1]))


    def parse(self, response):
        selector = scrapy.Selector(response)
        #item = CrawlmeizituItemPage()

        next_pages = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/@href').extract()
        next_pages_text = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/text()').extract()
        all_urls = []
        if '下一頁' in next_pages_text:
            next_url = "http://www.meizitu.com/a/{}".format(next_pages[-2])
            with open('..//url.txt', 'a+') as fp:
                fp.write('\n')
                fp.write(next_url)
                fp.write("\n")
            request = scrapy.http.Request(next_url, callback=self.parse)
            time.sleep(2)
            yield request

        all_info = selector.xpath('//h3[@class="tit"]/a')
        #讀取每個圖片夾的連接
        for info in all_info:
            links = info.xpath('//h3[@class="tit"]/a/@href').extract()
        for link in links:
            request = scrapy.http.Request(link, callback=self.parse_item)
            time.sleep(1)
            yield request

        # next_link = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/@href').extract()
        # next_link_text = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/text()').extract()
        # if '下一頁' in next_link_text:
        #     nextPage = "http://www.meizitu.com/a/{}".format(next_link[-2])
        #     item['page_url'] = nextPage
        #     yield item

            #抓取每個文件夾的信息
    def parse_item(self, response):
         item = CrawlmeizituItem()
         selector = scrapy.Selector(response)

         image_title = selector.xpath('//h2/a/text()').extract()
         image_url = selector.xpath('//h2/a/@href').extract()
         image_tags = selector.xpath('//div[@class="metaRight"]/p/text()').extract()
         if selector.xpath('//*[@id="picture"]/p/img/@src').extract():
            image_src = selector.xpath('//*[@id="picture"]/p/img/@src').extract()
         else:
            image_src = selector.xpath('//*[@id="maincontent"]/div/p/img/@src').extract()
         if selector.xpath('//*[@id="picture"]/p/img/@alt').extract():
             pic_name = selector.xpath('//*[@id="picture"]/p/img/@alt').extract()
         else:
            pic_name = selector.xpath('//*[@id="maincontent"]/div/p/img/@alt').extract()
         #//*[@id="maincontent"]/div/p/img/@alt
         item['title'] = image_title
         item['url'] = image_url
         item['tags'] = image_tags
         item['src'] = image_src
         item['alt'] = pic_name
         print(item)
         time.sleep(1)
         yield item

 




免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM