Python：輸入關鍵字進行百度搜索並爬取搜索結果

本文轉載自查看原文 2021-04-05 21:58 734 爬蟲/ Scrapy

學習自：手把手教你用Python爬取百度搜索結果並保存 - 雲+社區 - 騰訊雲

如何利用python模擬百度搜索,Python交流,技術交流區,魚C論壇

指定關鍵字，對其進行百度搜索，保存搜索結果，記錄下搜索的內容和標題

思路：

首頁：https://www.baidu.com/s?wd=* （將*替換為關鍵字即可）

其他頁：https://www.baidu.com/s?wd=*&pn=n（n/10+1為實際的頁）

1、利用關鍵字構建百度搜索的URL

2、爬蟲爬取該URL

3、分析每個可選項的XPath，記錄下可選項的名字與URL

4、注意到每一個搜索項的XPath為//*[@class="t"]/a，其名字就是該項的文本內容，而鏈接是該項的屬性href

#每一個搜索項的XPath
//*[@class="t"]/a

#每一項標題的XPath
.    #就一個點

#每一項鏈接的XPath
./@href

5、提取標題后，需要對用正則表達式進行篩選，因為頁面源代碼中有<em>與</em>標簽，所以需要用正則表達式將該標簽刪除。因此，這里不能直接用XPath函數text()提取，而應該用extract直接提取源代碼，然后用正則表達式從中提取需要的要素

        eles=response.xpath('//*[@class="t"]/a') #提取搜索每一項
        for ele in eles:
            name=ele.xpath('.').extract()    #提取標題相關的要素源碼,extract方法返回一個List
            name=''.join(name).strip()        #要將List中的要素連接起來
            name=name.replace('<em>','').replace('</em>', '')#刪除其中的<em>與</em>標簽
            re_bd=re.compile(r'>(.*)</a>')#構建re compile
            item['name']=re_bd.search(name).groups(1)#篩選name項
            item['link']=ele.xpath('./@href').extract()[0]#直接提取Link
            yield item

6、完整代碼如下

import scrapy
from scrapy import Request
from BD.items import BdItem
import re

class BdsSpider(scrapy.Spider):
    name = 'BDS'
    allowed_domains = ['www.baidu.com']
    key=input('輸入關鍵字')
    url='http://www.baidu.com/s?wd='+key
    start_urls = [url]

    def parse(self, response):
        item=BdItem()
        eles=response.xpath('//*[@class="t"]/a')
        for ele in eles:
            name=ele.xpath('.').extract()
            name=''.join(name).strip()
            name=name.replace('<em>','').replace('</em>', '')
            re_bd=re.compile(r'>(.*)</a>')
            item['name']=re_bd.search(name).groups(1)
            item['link']=ele.xpath('./@href').extract()[0]
            yield item
        next_url = self.url + '&pn=10'
        yield Request(url=next_url)

7、運行

scrapy crawl BDS -O baidu.csv

其他

Setting中需要設置User-Agent，以避免被百度識別為爬蟲而拒絕請求

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲-python（三）百度搜索關鍵詞后爬取搜索結果爬蟲與Python：（三）基本庫的使用——擴展:百度搜索關鍵字提交 python爬取百度搜索結果url匯總 python爬取百度搜索圖片使用php的curl根據關鍵詞爬取百度搜索結果頁利用百度搜索結果爬取郵箱 02_輸入檢索詞自動爬取百度搜索頁標題信息 js 模擬百度關鍵字搜索與跳轉百度地圖（36）-GL 關鍵字搜索 Python抓取百度搜索結果