第一種,setting里面有一個默認的請求頭
USER_AGENT = 'scrapy_runklist (+http://www.yourdomain.com)'
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_runklist.middlewares.ScrapyRunklistDownloaderMiddleware': 543,
}
- 這個是默認注釋的,如果要打開注意改掉,
- 這樣就很容易導致瀏覽器封掉的可能
- 我們可以打印一下這個請求頭,在下載中間件,print("request", request.headers)
- 我們可以看到就是上面設置的,
- 但是這個是全局的設置,每一個爬蟲都是一樣的,怎么定制其他的header參數
第二種,怎么添加自己的請求頭
- 可以直接在spider文件中添加custom_settings 這個設置
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'LOG_FILE': '5688_log_%s.txt' % time.time(), # 配置的日志
"DEFAULT_REQUEST_HEADERS": {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}
} # 添加的請求頭
- 這樣就就是走的我們自己的配置請求頭了
- setting里面的可以不用注釋,都已經不生效了
- 還可以添加其他的header參數,
第三種,還可以添加隨機的請求頭
- 第一步, 在settings文件中添加一些UserAgent,在這里筆者是查找別人的
USER_AGENT_LIST=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
- 同時在settings文件中設置 “DOWNLOADER_MIDDLEWARES”
DOWNLOADER_MIDDLEWARES = {
# 'lagou.middlewares.LagouDownloaderMiddleware': 543,
'lagou(項目的名稱).middlewares.RandomUserAgentMiddleware': 400,
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
- 第二步驟, 在 middlewares.py 文件中導入 settings模塊中的 USER_AGENT_LIST 方法
from lagou.settings import USER_AGENT_LIST
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
rand_use = random.choice(USER_AGENT_LIST)
if rand_use:
request.headers.setdefault('User-Agent', rand_use)
- 這樣就可以了
- 運行起來我們發現是可以進行隨機選擇這個ua的,