scrapy基礎之數據爬取


1.創建scrapy項目,命令: scrapy startproject scrapyspider(項目名稱)
2.在創建項目的根目錄下創建spider,命令:scrapy genspider myspider(爬蟲名稱) www.baidu.com(爬取url)
3.使用pycharm打開爬蟲項目,爬蟲模板如下

    class JobboleSpider(scrapy.Spider):
        name = 'jobbole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/all-posts/']

        def parse(self, response):
            pass

4.如上代碼parse函數是對start_urls中的url進行解析的函數,如下代碼

    def parse(self, response):
        # 1.獲取文章列表頁中文章的url交給scrapy下載后並交給解析函數進行具體字段的解析
        post_nodes =  response.xpath("//div[@id='archive']/div[contains(@class,'floated-thumb')]/div[@class='post-thumb']/a")
        for post_node in post_nodes:
            image_url = post_node.xpath("img/@src").extract_first()
            url = post_node.xpath("@href").extract_first()
            yield Request(url=parse.urljoin(response.url, url), meta={"front_image_url":parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
        # 2.獲取下一頁的url交給scrapy進行下載
        next_url = response.xpath("//a[@class='next page-numbers']/@href").extract_first("")
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse) 

(1)調用scrapy中request函數將具體頁面內容交給callback(回調函數)parse_detail進行處理,並且在request中傳入參數圖片的url
     添加的參數 : meta={"front_image_url":parse.urljoin(response.url, image_url)}
(2)將獲取的下一頁的列表頁request出去,交給回調函數parse,就是這個函數進行列表頁處理

5.具體頁面解析函數parse_detail

    def parse_detail(self, response):
        article_item = JobBoleArticleItem()
        front_image_url = response.meta.get("front_image_url", "")
        title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first()
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().replace('·','').strip()
        praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first())
        if praise_nums:
            praise_nums = int(praise_nums)
        else:
            praise_nums = 0
        fav_nums =  response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first()
        match_re = re.match(".*?(\d+).*", fav_nums)
        if match_re:
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0
        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract_first()
        match_re = re.match(".*?(\d+).*", comment_nums)
        if match_re:
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0
        tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        #TODO 有問題,原因:endswith函數拼寫錯誤
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)
        article_item["title"] = get_md5(response.url)
        article_item["title"] = title
        article_item["url"] = response.url
        try:
            create_date = datetime.datetime.strftime(create_date, "%Y/%m/%d").date()
        except Exception as e:
            create_date = datetime.datetime.now().date()
        article_item["create_date"] = create_date
        # scrapy中獲取的是image數組,需要將值改為數組類型
        article_item["front_image_url"] = [front_image_url]
        article_item["praise_nums"] = praise_nums
        article_item["comment_nums"] = comment_nums
        article_item["fav_nums"] = fav_nums
        article_item["tags"] = tags

        yield article_item

  (1)從response中解析出具體內容,並對內容進行判斷
  (2)其中涉及到從response中手動傳入的參數front_image_url,獲取如下:front_image_url = response.meta.get("front_image_url", "")
  (3)將處理后的內容放到item中,yield(拋出)item

6.item需要在items.py中自己定義,定義如下:

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    front_image_url = scrapy.Field()
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field()
    comment_nums = scrapy.Field()
    fav_nums = scrapy.Field()
    tags = scrapy.Field()

7.數據的導出(導出到數據庫或本地文件中)以及圖片的下載,需要在piplines.py中定義文件下載和數據存儲的piplines以及在settings.py文件中配置
  (1)數據存儲到mysql中,調用scrapy中twisted框架進行異步存儲(原因:爬取速度過快時,數據存儲會限制爬取)

        class JobBoleMysqlTwistedPipline(object):

            def __init__(self, dbpool):
                self.dbpool = dbpool
            # python靜態函數,從settings中讀取數據庫的配置
            @classmethod
            def from_settings(cls, settings):
                dbparms = dict(
                    host = settings["MYSQL_HOST"],
                    db = settings["MYSQL_DBNAME"],
                    user = settings["MYSQL_USER"],
                    passwd = settings["MYSQL_PASSWORD"],
                    charset = "utf8",
                    cursorclass = MySQLdb.cursors.DictCursor,
                    use_unicode = True
                )
                dbpool = adbapi.ConnectionPool("MySqldb", **dbparms)
                return cls(dbpool)
            def process_item(self, item, spider):
                # 使用twisted將mysql插入變成異步執行
                self.dbpool.runInteraction(self.do_insert, item)
            def do_insert(self, cursor, item):
                insert_sql = """
                            insert into jobbole(title, create_date, url, url_object_id, front_image_url, comment_nums, fav_nums, praise_nums, tags)
                            values (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                        """
                cursor.execute(insert_sql, (item["title"], item["create_date"], item["url"], item["url_object_id"],
                                                item["front_image_url"], item["comment_nums"], item["fav_nums"],
                                                item["praise_nums"],
                                                item["tags"]))

  (2)圖片下載

    # 繼承了scrapy中ImagesPipeline,重寫圖片地址方法,具體下載是scapy完成
    class JobBoleImagePipeline(ImagesPipeline):
        def item_completed(self, results, item, info):
            for ok, value in results:
                image_file_path = value["path"]
                item["front_image_path"] = image_file_path
                return item

  (3)修改settings中對pipline的配置,將添加的pipeline添加到里面即可,后面數字越小越先執行

        ITEM_PIPELINES = {
       # 'webspider.pipelines.JobBoleImagePipeline': 1,
        'webspider.pipelines.JobBoleMysqlTwistedPipline': 1,
    }

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM