一、安裝和使用
fake_useragent第三方庫,來實現隨機請求頭的設置;
GitHub ---> https://github.com/hellysmile/fake-useragent
安裝 ---> pip3 install fake-useragent
查看useragent ---> http://fake-useragent.herokuapp.com/browsers/0.1.5
關鍵是后面的版本號,如果更新后使用原版本號就查看不到useragent;
如何操作最新版本號?通過pip3 list 查看安裝的版本號;
使用
from fake_useragent import UserAgent ua = UserAgent() print(ua.ie) #隨機打印ie瀏覽器任意版本 print(ua.firefox) #隨機打印firefox瀏覽器任意版本 print(ua.chrome) #隨機打印chrome瀏覽器任意版本 print(ua.random) #隨機打印任意廠家的瀏覽器
二、應用於scrapy爬蟲項目
首先在middlewares.py中自定義隨機請求頭的類
根據scrapy源碼中: scrapy目錄--->downloadermiddlewares--->useragent.py 中的 UserAgentMiddleware類來寫middlewares.py隨機請求頭的類
源碼中useragent.py
"""Set User-Agent header per spider or use a default value from settings""" from scrapy import signals class UserAgentMiddleware(object): """This middleware allows spiders to override the user_agent""" def __init__(self, user_agent='Scrapy'): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): o = cls(crawler.settings['USER_AGENT']) crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): self.user_agent = getattr(spider, 'user_agent', self.user_agent) def process_request(self, request, spider): if self.user_agent: request.headers.setdefault(b'User-Agent', self.user_agent)
middlewares.py定義隨機請求頭的類
class RandomUserAgentMiddlware(object):
'''隨機更換user-agent,基本上都是固定格式和scrapy源碼中useragetn.py中UserAgentMiddleware類中一致'''
def __init__(self,crawler): super(RandomUserAgentMiddlware,self).__init__() self.ua = UserAgent()
#從配置文件settings中讀取RANDOM_UA_TYPE值,默認為random,可以在settings中自定義 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE","random") @classmethod def from_crawler(cls,crawler): return cls(crawler) def process_request(self,request,spider):#必須和內置的一致,這里必須這樣寫 def get_ua(): return getattr(self.ua,self.ua_type) request.headers.setdefault('User-Agent',get_ua())
settings里面的配置
DOWNLOADER_MIDDLEWARES = {
'ArticleSpider.middlewares.RandomUserAgentMiddlware': 543, #將在middlewares.py中定義了RandomUserAgentMiddlware類添加到這里; 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None, #需要將scrapy默認的置為None不調用 } RANDOM_UA_TYPE = "random" #或者指定瀏覽器 firefox、chrome...
PS:配置好后取消原來spider中定義的User-Agent。再次進行爬蟲時,會自動攜帶隨機生成的User-Agent,不需要在每個spider中自定義了;
轉自https://www.cnblogs.com/qingchengzi/p/9633616.html
喜歡這篇文章?歡迎打賞~~