1.創建scrapy項目,命令: scrapy startproject scrapyspider(項目名稱)
2.在創建項目的根目錄下創建spider,命令:scrapy genspider myspider(爬蟲名稱) www.baidu.com(爬取url)
3.使用pycharm打開爬蟲項目,爬蟲模板如下
class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): pass
4.如上代碼parse函數是對start_urls中的url進行解析的函數,如下代碼
def parse(self, response): # 1.獲取文章列表頁中文章的url交給scrapy下載后並交給解析函數進行具體字段的解析 post_nodes = response.xpath("//div[@id='archive']/div[contains(@class,'floated-thumb')]/div[@class='post-thumb']/a") for post_node in post_nodes: image_url = post_node.xpath("img/@src").extract_first() url = post_node.xpath("@href").extract_first() yield Request(url=parse.urljoin(response.url, url), meta={"front_image_url":parse.urljoin(response.url, image_url)}, callback=self.parse_detail) # 2.獲取下一頁的url交給scrapy進行下載 next_url = response.xpath("//a[@class='next page-numbers']/@href").extract_first("") if next_url: yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
(1)調用scrapy中request函數將具體頁面內容交給callback(回調函數)parse_detail進行處理,並且在request中傳入參數圖片的url
添加的參數 : meta={"front_image_url":parse.urljoin(response.url, image_url)}
(2)將獲取的下一頁的列表頁request出去,交給回調函數parse,就是這個函數進行列表頁處理
5.具體頁面解析函數parse_detail
def parse_detail(self, response): article_item = JobBoleArticleItem() front_image_url = response.meta.get("front_image_url", "") title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first() create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().replace('·','').strip() praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first()) if praise_nums: praise_nums = int(praise_nums) else: praise_nums = 0 fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first() match_re = re.match(".*?(\d+).*", fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract_first() match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() #TODO 有問題,原因:endswith函數拼寫錯誤 tag_list = [element for element in tag_list if not element.strip().endswith("評論")] tags = ",".join(tag_list) article_item["title"] = get_md5(response.url) article_item["title"] = title article_item["url"] = response.url try: create_date = datetime.datetime.strftime(create_date, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() article_item["create_date"] = create_date # scrapy中獲取的是image數組,需要將值改為數組類型 article_item["front_image_url"] = [front_image_url] article_item["praise_nums"] = praise_nums article_item["comment_nums"] = comment_nums article_item["fav_nums"] = fav_nums article_item["tags"] = tags yield article_item
(1)從response中解析出具體內容,並對內容進行判斷
(2)其中涉及到從response中手動傳入的參數front_image_url,獲取如下:front_image_url = response.meta.get("front_image_url", "")
(3)將處理后的內容放到item中,yield(拋出)item
6.item需要在items.py中自己定義,定義如下:
class JobBoleArticleItem(scrapy.Item): title = scrapy.Field() create_date = scrapy.Field() url = scrapy.Field() url_object_id = scrapy.Field() front_image_url = scrapy.Field() front_image_path = scrapy.Field() praise_nums = scrapy.Field() comment_nums = scrapy.Field() fav_nums = scrapy.Field() tags = scrapy.Field()
7.數據的導出(導出到數據庫或本地文件中)以及圖片的下載,需要在piplines.py中定義文件下載和數據存儲的piplines以及在settings.py文件中配置
(1)數據存儲到mysql中,調用scrapy中twisted框架進行異步存儲(原因:爬取速度過快時,數據存儲會限制爬取)
class JobBoleMysqlTwistedPipline(object): def __init__(self, dbpool): self.dbpool = dbpool # python靜態函數,從settings中讀取數據庫的配置 @classmethod def from_settings(cls, settings): dbparms = dict( host = settings["MYSQL_HOST"], db = settings["MYSQL_DBNAME"], user = settings["MYSQL_USER"], passwd = settings["MYSQL_PASSWORD"], charset = "utf8", cursorclass = MySQLdb.cursors.DictCursor, use_unicode = True ) dbpool = adbapi.ConnectionPool("MySqldb", **dbparms) return cls(dbpool) def process_item(self, item, spider): # 使用twisted將mysql插入變成異步執行 self.dbpool.runInteraction(self.do_insert, item) def do_insert(self, cursor, item): insert_sql = """ insert into jobbole(title, create_date, url, url_object_id, front_image_url, comment_nums, fav_nums, praise_nums, tags) values (%s, %s, %s, %s, %s, %s, %s, %s, %s) """ cursor.execute(insert_sql, (item["title"], item["create_date"], item["url"], item["url_object_id"], item["front_image_url"], item["comment_nums"], item["fav_nums"], item["praise_nums"], item["tags"]))
(2)圖片下載
# 繼承了scrapy中ImagesPipeline,重寫圖片地址方法,具體下載是scapy完成 class JobBoleImagePipeline(ImagesPipeline): def item_completed(self, results, item, info): for ok, value in results: image_file_path = value["path"] item["front_image_path"] = image_file_path return item
(3)修改settings中對pipline的配置,將添加的pipeline添加到里面即可,后面數字越小越先執行
ITEM_PIPELINES = { # 'webspider.pipelines.JobBoleImagePipeline': 1, 'webspider.pipelines.JobBoleMysqlTwistedPipline': 1, }
