scrapy全站爬取拉勾網及CrawSpider介紹

本文轉載自查看原文 2018-10-04 21:53 1565 爬蟲/ Scrapy/ Python

一.指定模板創建爬蟲文件

命令

創建成功后的模板，把http改為https

二.CrawSpider源碼介紹

　　1.官網介紹：

　　　　這是用於抓取常規網站的最常用的蜘蛛，因為它通過定義一組規則為跟蹤鏈接提供了便利的機制。它可能不是最適合您的特定網站或項目，但它在幾種情況下足夠通用，因此您可以從它開始並根據需要覆蓋它以獲得更多自定義功能，或者只是實現您自己的蜘蛛。

　　　　除了從Spider繼承的屬性（您必須指定）之外，此類還支持一個新屬性：

　　　　rules: 　　　　這是一個（或多個）Rule對象的列表。每個Rule 定義用於爬網站點的特定行為。規則對象如下所述。如果多個規則匹配相同的鏈接，則將根據它們在此屬性中定義的順序使用第一個規則。

　　　　這個蜘蛛還暴露了一個可重寫的方法：

　　　　parse_start_url （回應）: 　　　　為start_urls響應調用此方法。它允許解析初始響應，並且必須返回 Item對象，Request 對象或包含其中任何一個的iterable。

　　　　爬行規則

　　　　class scrapy.spiders. Rule （link_extractor，callback = None，cb_kwargs = None，follow = None，process_links = None，process_request = None ）

　　　　link_extractor是一個Link Extractor對象，它定義如何從每個已爬網頁面中提取鏈接。

　　　　callback是一個可調用的或一個字符串（在這種情況下，將使用具有該名稱的spider對象的方法）為使用指定的link_extractor提取的每個鏈接調用。此回調接收響應作為其第一個參數，並且必須返回包含Item和/或 Request對象（或其任何子類）的列表。

　　　　警告

　　　　編寫爬網蜘蛛規則時，請避免使用parse回調，因為CrawlSpider使用parse方法本身來實現其邏輯。因此，如果您覆蓋該parse方法，則爬網蜘蛛將不再起作用。

　　　　cb_kwargs 是一個包含要傳遞給回調函數的關鍵字參數的dict。

　　　　follow是一個布爾值，它指定是否應該從使用此規則提取的每個響應中跟蹤鏈接。如果callback是，則follow默認為True，否則默認為False。

　　　　process_links是一個可調用的，或一個字符串（在這種情況下，將使用來自具有該名稱的蜘蛛對象的方法），將使用指定的每個響應提取的每個鏈接列表調用該方法link_extractor。這主要用於過濾目的。

　　　　process_request 是一個可調用的，或一個字符串（在這種情況下，將使用來自具有該名稱的spider對象的方法），該方法將在此規則提取的每個請求中調用，並且必須返回請求或None（以過濾掉請求）。

　　　　CrawlSpider示例

　　　　　　現在讓我們看看一個帶有規則的示例CrawlSpider：

 
            import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item 
           

　　這個spider會開始抓取example.com的主頁，收集類別鏈接和項目鏈接，使用該parse_item方法解析后者。對於每個項目響應，將使用XPath從HTML中提取一些數據，並將Item使用它填充。

　　2.源碼分析：

　　　　　　CrawSpider繼承Spider：

　　　　　　Spider中的start_request()方法和make_requests_from_url()方法實現遍歷start_urls中的url，如下：

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

　　　　使用Spider（basic）模板時，需要重寫parse（）函數處理爬蟲邏輯，而crawspider已經寫好了該函數如下，該函數調用_parse_response(),判斷是否有回調函數，把參數傳遞給parse_start_url()，返回一個空數組，然后調用process_result()函數返回參數（注：如果不重寫，沒什么用，相當於什么也沒干，可以重寫加邏輯），然后判斷follw是否為True和_follow_links是否為True（默認為True，可以配置），然后循環_requests_to_follow（）函數的返回值，該函數判斷是否為response，如果不是則什么也不返回，然后通過set方法對response的url去重，然后把rule使用enumerate()方法把它變成可迭代的對象：

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

 def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

 def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

　　　　該類在定義時調用_compile_rules（）方法，該函數會調用回調函數，process_links()也是個方法，在rule類中，可以處理url等（如為了負載均衡，每個地方的ip下的域名不同，可以處理），然后_requests_to_follow（）抽取link添加到seen中，可以自己重寫process_links函數處理url，又調用_build_request（）方法，該函數的回調函數為_response_downloaded（），該函數把response返回給_parse_response（）

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

　　　　簡單總結：

　　　　　　繼承Spider，Spider入口函數為start_requests（），默認返回處理函數為parse（），這時parse（）函數會調用_parse_response（），允許我們自己定義重寫parse_start_url（），process_results（）對parse做處理，處理完成后，會去調用rule，然后把response交給rule中得LinkExtractor，有allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=()【此參數可以進一步限定url】等參數處理url，然后_requests_to_follow（）會抽取處理過后的link，然后對每一個link都yeild一個Request，然后有一個_response_downloaded（）取rule，然后回調給_parse_response（）函數。

三.爬取拉鈎代碼

　　1.rule（allow是一個正則匹配，可以傳遞元組和字符串）：

rules = (
        Rule(LinkExtractor(allow=('zhaopin/.*',)), follow=True),
        Rule(LinkExtractor(allow=r'gongsi/j\d+.html'), follow=True),
        Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_item', follow=True),
    )

　　2.scrapy shell調試獲取內容（注：這里要指定user-agent，不然狀態碼雖然是200但是沒有數據，-s指定，如sceapy shell -s "..." url）

　　　　如：scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/60.0" https://www.lagou.com/jobs/4923444.html

　　3.item設計及實例化設計(需要設置請求頭，填寫Spider類中的custom_setting設置或重寫start_request()方法)：

　　　　3.1item設計及處理相應字段函數

 1 def replace_splash(value):
 2     return value.replace("/", "")
 3 
 4 
 5 def handle_strip(value):
 6     return value.strip()
 7 
 8 
 9 def handle_jobaddr(value):
10     addr_list = value.split("\n")
11     addr_list = [item.strip() for item in addr_list if item.strip() != "查看地圖"]
12     return "".join(addr_list)
13 def leave_time(value):
14     #處理發布時間
15     return value.split()[0]
16 
17 class LagouJobItemLoader(ItemLoader):
18     # 自定義itemloader
19     default_output_processor = TakeFirst()
20 
21 class LagouJobItem(scrapy.Item):
22     # 拉勾網職位
23     title = scrapy.Field()
24     url = scrapy.Field()
25     url_object_id = scrapy.Field()
26     salary = scrapy.Field()
27     tags=scrapy.Field(
28         output_processor=Join(',')
29     )
30     job_city = scrapy.Field(
31         input_processor=MapCompose(replace_splash),
32     )
33     work_years = scrapy.Field(
34         input_processor=MapCompose(replace_splash),
35     )
36     degree_need = scrapy.Field(
37         input_processor=MapCompose(replace_splash),
38     )
39     job_type = scrapy.Field()
40     publish_time = scrapy.Field(
41         input_processor=MapCompose(leave_time)
42     )
43     job_advantage = scrapy.Field()
44     job_desc = scrapy.Field(
45         input_processor=MapCompose(remove_tags,handle_strip),
46         output_processor=Join(',')
47     )
48     job_addr = scrapy.Field(
49         input_processor=MapCompose(remove_tags, handle_jobaddr),
50     )
51     company_name = scrapy.Field(
52         input_processor=MapCompose(handle_strip),
53     )
54     company_url = scrapy.Field()
55     crawl_time = scrapy.Field()
56     crawl_update_time = scrapy.Field()

View Code

　　　　3.2實列化item（使用item_loader方法）

 1     custom_settings = {
 2         "COOKIES_ENABLED": False,
 3         "DOWNLOAD_DELAY": 1,
 4         'DEFAULT_REQUEST_HEADERS': {
 5             'Accept': 'application/json, text/javascript, */*; q=0.01',
 6             'Accept-Encoding': 'gzip, deflate, br',
 7             'Accept-Language': 'zh-CN,zh;q=0.8',
 8             'Connection': 'keep-alive',
 9             'Cookie': 'user_trace_token=20171015132411-12af3b52-3a51-466f-bfae-a98fc96b4f90; LGUID=20171015132412-13eaf40f-b169-11e7-960b-525400f775ce; SEARCH_ID=070e82cdbbc04cc8b97710c2c0159ce1; ab_test_random_num=0; X_HTTP_TOKEN=d1cf855aacf760c3965ee017e0d3eb96; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; PRE_UTM=; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DsXIrWUxpNGLE2g_bKzlUCXPTRJMHxfCs6L20RqgCpUq%26wd%3D%26eqid%3Dee53adaf00026e940000000559e354cc; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; index_location_city=%E5%85%A8%E5%9B%BD; TG-TRACK-CODE=index_hotjob; login=false; unick=""; _putrc=""; JSESSIONID=ABAAABAAAFCAAEG50060B788C4EED616EB9D1BF30380575; _gat=1; _ga=GA1.2.471681568.1508045060; LGSID=20171015203008-94e1afa5-b1a4-11e7-9788-525400f775ce; LGRID=20171015204552-c792b887-b1a6-11e7-9788-525400f775ce',
10             'Host': 'www.lagou.com',
11             'Origin': 'https://www.lagou.com',
12             'Referer': 'https://www.lagou.com/',
13             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
14         }
15     }
16 
17 
18     def parse_job(self, response):
19         item_loader = LagouJobItemLoader(item=LagouJobItem(), response=response)
20         # i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
21         # i['name'] = response.xpath('//div[@id="name"]').extract()
22         # i['description'] = response.xpath('//div[@id="description"]').extract()
23         item_loader.add_css("title", ".job-name::attr(title)")
24         item_loader.add_value("url", response.url)
25         item_loader.add_value("url_object_id",get_md5(response.url))
26         item_loader.add_css("salary", ".job_request p span.salary::text")
27         item_loader.add_xpath("job_city", "//dd[@class='job_request']/p/span[2]/text()")
28         item_loader.add_xpath("work_years", "//dd[@class='job_request']/p/span[3]/text()")
29         item_loader.add_xpath("degree_need", "//dd[@class='job_request']/p/span[4]/text()")
30         item_loader.add_xpath("job_type", "//dd[@class='job_request']/p/span[5]/text()")
31         item_loader.add_css("publish_time", ".job_request p.publish_time::text")
32         item_loader.add_css("job_advantage", ".job-advantage p::text")
33         item_loader.add_css("job_desc", ".job_bt div p")
34         item_loader.add_css("job_addr", ".work_addr")
35         item_loader.add_css("tags",".position-label.clearfix li::text")
36         item_loader.add_css("company_name", ".job_company dt a img::attr(alt)")
37         item_loader.add_css("company_url", ".job_company dt a::attr(href)")
38         item_loader.add_value("crawl_time", datetime.datetime.now())
39         # item_loader.add_css("crawl_update_time",".work_addr")
40         lagou_item = item_loader.load_item()
41         return lagou_item

View Code

　　　　3.3處理后調試內容如下　　

　　4.sql語句書寫（也寫在items.py中，方便管理）

    def get_insert_sql(self):
        insert_sql = """
            insert into lagou_spider(title, url, url_object_id, tags,salary, job_city, work_years, degree_need,
            job_type, publish_time, job_advantage, job_desc, job_addr, company_url, company_name, job_id,crawl_time)
            VALUES (%s, %s, %s, %s, %s, %s ,%s, %s, %s, %s, %s, %s, %s, %s, %s, %s,%s) ON DUPLICATE KEY UPDATE job_desc=VALUES(job_desc)
        """
　　　　　#利用正則獲取url中的id
        job_id = extract_num(self["url"])
        params = (self["title"], self["url"], self['url_object_id'],self['tags'], self["salary"], self["job_city"], self["work_years"], self["degree_need"],
                  self["job_type"], self["publish_time"], self["job_advantage"], self["job_desc"], self["job_addr"],
                  self["company_url"],
                  self["company_name"], job_id,self['crawl_time'].strftime(SQL_DATETIME_FORMAT))

        return insert_sql, params

　　5.到這數據已經能爬取並保存了：

　　6.注意：

　　　訪問過於頻繁拉鈎網會禁ip（這是常用的反爬技術，只需使用ip代理池就行），網頁無法正常返回，但是狀態碼仍然是200（正規應該是403，我們可以依靠狀態碼監控），雖然加大了爬取的難度（對於拉鈎網可以判斷url中是否有forbidden把這樣的url過濾掉，然后把爬蟲暫停會或換ip），但是對於百度谷歌等搜索引擎的爬蟲也判斷為200的狀態，會把它納入搜索中，當SEO爬取到這些網頁，會判斷這些頁面內容都是相同的（以為有惡意SEO的表現），會降權，是很不友好的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取拉勾網利用Scrapy爬取拉勾網某職位信息 Scrapy全站數據爬取爬蟲---scrapy全站爬取拉勾網數據爬取【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位信息（2）爬取分析拉勾網招聘信息【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位信息（1） python-scrapy爬蟲框架爬取拉勾網招聘信息 Scrapy爬取拉勾網數據分析崗位+可視化