Scrapy安裝:
1,首先進入虛擬環境
2,使用國內豆瓣源進行安裝,快!
1 pip install -i https://pypi.douban.com/simple/ scrapy
3,特殊情況出錯:缺少c++,解決辦法:自己安裝了個vs2015
基本命令:
1 scrapy --help 2 Available commands: 3 bench Run quick benchmark test 4 commands 5 fetch Fetch a URL using the Scrapy downloader 6 genspider Generate new spider using pre-defined templates 7 runspider Run a self-contained spider (without creating a project) 8 settings Get settings values 9 shell Interactive scraping console 10 startproject Create new project 11 version Print Scrapy version 12 view Open URL in browser, as seen by Scrapy 13 14 [ more ] More commands available when run from project directory 15 到時候用到再說
創建工程:
在這里只能通過命令行:pycharm 沒有加載scrapy,與Django 不一樣
命令:
#注意:cd 到所需創建工程的目錄下
scrapy startproject projectname
默認是沒有模板的,還需要自己命令創建
目錄樹:(main是后來自己建的)
創建爬蟲模板:
好比在Django中創建一個APP,在次創建一個爬蟲
命令:
#注意:必須在該工程目錄下
#創建一個名字為blogbole,爬取root地址為blog.jobbole.com 的爬蟲;爬伯樂在線
scrapy genspider jobbole blog.jobbole.com
1 創建的文件: 2 # -*- coding: utf-8 -*- 3 import scrapy 4 5 6 class JobboleSpider(scrapy.Spider): 7 #爬蟲名字 8 name = "jobbole" 9 #運行爬取的域名 10 allowed_domains = ["blog.jobbole.com"] 11 #開始爬取的URL 12 start_urls = ['http://blog.jobbole.com'] 13 14 #爬取函數 15 def parse(self, response): 16 #xpath 解析response內容,提取數據 17 #//*[@id="post-110769"]/div[1]/h1 18 re_selector = response.xpath('//*[@id="post-110769"]/div[1]/h1/text()') 19 re2_selector = response.xpath('/html/body/div[3]/div[1]/h1/text()') 20 re3_selector = response.xpath('//div[@class="entry-header"]/h1/text()') 21 22 pass
至此,一個爬蟲工程建立完畢;