最近在學python,對python爬蟲框架十分着迷,因此在網上看了許多大佬們的代碼,經過反復測試修改,終於大功告成!
原文地址是:https://blog.csdn.net/ljm_9615/article/details/76694188
我的運行環境是win10,用的是python3.6,開發軟件pycharm
1.創建項目
cmd進入你要創建的目錄下面,scrapy startproject doubanmovie
用pycharm打開,目錄如下:
#在spiders文件夾下編寫自己的爬蟲
#在items中編寫容器用於存放爬取到的數據
#在pipelines中對數據進行各種操作
# 在settings中進行項目的各種設置
2.編寫代碼
在items編寫數據對象方便對數據操作管理,代碼如下

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MovieItem(scrapy.Item): # 電影名字 name = scrapy.Field() # 電影信息 info = scrapy.Field() # 評分 rating = scrapy.Field() # 評論人數 num = scrapy.Field() # 經典語句 quote = scrapy.Field() # 電影圖片 img_url = scrapy.Field() #序號 id_num = scrapy.Field()
在spiders下面創建my_spider.py文件,完整代碼如下:

import scrapy from doubanmovie.items import MovieItem class DoubanMovie(scrapy.Spider): # 爬蟲唯一標識符 name = 'doubanMovie' # 爬取域名 allowed_domain = ['movie.douban.com'] # 爬取頁面地址 start_urls = ['https://movie.douban.com/top250'] #def parse(self, response): # print(response.body) def parse(self, response): selector = scrapy.Selector(response) # 解析出各個電影 movies = selector.xpath('//div[@class="item"]') # 存放電影信息 item = MovieItem() for movie in movies: # 電影各種語言名字的列表 titles = movie.xpath('.//span[@class="title"]/text()').extract() # 將中文名與英文名合成一個字符串 name = '' for title in titles: name += title.strip() item['name'] = name # 電影信息列表 infos = movie.xpath('.//div[@class="bd"]/p/text()').extract() # 電影信息合成一個字符串 fullInfo = '' for info in infos: fullInfo += info.strip() item['info'] = fullInfo # 提取評分信息 item['rating'] = movie.xpath('.//span[@class="rating_num"]/text()').extract()[0].strip() # 提取評價人數 item['num'] = movie.xpath('.//div[@class="star"]/span[last()]/text()').extract()[0].strip()[:-3] # 提取經典語句,quote可能為空 quote = movie.xpath('.//span[@class="inq"]/text()').extract() if quote: quote = quote[0].strip() else: quote = 'null' item['quote'] = quote # 提取電影圖片 item['img_url'] = movie.xpath('.//img/@src').extract()[0] item['id_num'] = movie.xpath('.//em/text()').extract()[0] yield item next_page = selector.xpath('//span[@class="next"]/a/@href').extract() if next_page: url = 'https://movie.douban.com/top250' + next_page[0] yield scrapy.Request(url, callback=self.parse)
在pipelines.py操作數據
用json格式文件輸出爬取到的電影信息

class DoubanmoviePipeline(object): def __init__(self): # 打開文件 self.file = open('data.json', 'w', encoding='utf-8') # 該方法用於處理數據 def process_item(self, item, spider): # 讀取item中的數據 line = json.dumps(dict(item), ensure_ascii=False) + "\n" # 寫入文件 self.file.write(line) # 返回item return item # 該方法在spider被開啟時被調用。 def open_spider(self, spider): pass # 該方法在spider被關閉時被調用。 def close_spider(self, spider): self.file.close()
爬取圖片並保存到本地

class ImagePipeline(ImagesPipeline): def get_media_requests(self, item, info): yield scrapy.Request(item['img_url']) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['img_url'] = image_paths return item
爬取信息寫入數據庫,我用的mysql數據庫做測試

class DBPipeline(object): def __init__(self): # 連接數據庫 self.connect = pymysql.connect( host=settings.MYSQL_HOST, port=3306, db=settings.MYSQL_DBNAME, user=settings.MYSQL_USER, passwd=settings.MYSQL_PASSWD, charset='utf8', use_unicode=True) # 通過cursor執行增刪查改 self.cursor = self.connect.cursor(); def process_item(self, item, spider): try: self.cursor.execute( """insert into doubanmovie(name, info, rating, num ,quote, img_url,id_num) value (%s, %s, %s, %s, %s, %s,%s)""", (item['name'], item['info'], item['rating'], item['num'], item['quote'], item['img_url'], item['id_num'] )) # 提交sql語句 self.connect.commit() except Exception as error: # 出現錯誤時打印錯誤日志 log(error) return item
在settings.py添加

#防止爬取被禁,報錯403forbidden USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0' #如果爬取過程中又出現問題:Forbidden by robots.txt 將ROBOTSTXT_OBEY改為False,讓scrapy不遵守robot協議,即可正常下載圖片 ITEM_PIPELINES = { 'doubanmovie.pipelines.DoubanmoviePipeline': 1, 'doubanmovie.pipelines.ImagePipeline': 100, 'doubanmovie.pipelines.DBPipeline': 10, } 數字1表示優先級,越低越優先 IMAGES_STORE = 'E:\\img\\' #自定義圖片存儲路徑 MYSQL_HOST = 'localhost' MYSQL_DBNAME = 'douban' MYSQL_USER = 'root' MYSQL_PASSWD = '0000' #數據庫配置
3.報錯問題
基本上報錯都是缺少這個或那個文件造成的,網上搜一搜就能找到!
部分報錯是由於看了不同大佬代碼,導致一些邏輯錯誤,根據報錯提示很輕松就能解決!
完整項目:https://github.com/theSixthDay/doubanmovie.git