Python：输入关键字进行百度搜索并爬取搜索结果

本文转载自查看原文 2021-04-05 21:58 734 爬虫/ Scrapy

学习自：手把手教你用Python爬取百度搜索结果并保存 - 云+社区 - 腾讯云

如何利用python模拟百度搜索,Python交流,技术交流区,鱼C论坛

指定关键字，对其进行百度搜索，保存搜索结果，记录下搜索的内容和标题

思路：

首页：https://www.baidu.com/s?wd=* （将*替换为关键字即可）

其他页：https://www.baidu.com/s?wd=*&pn=n（n/10+1为实际的页）

1、利用关键字构建百度搜索的URL

2、爬虫爬取该URL

3、分析每个可选项的XPath，记录下可选项的名字与URL

4、注意到每一个搜索项的XPath为//*[@class="t"]/a，其名字就是该项的文本内容，而链接是该项的属性href

#每一个搜索项的XPath
//*[@class="t"]/a

#每一项标题的XPath
.    #就一个点

#每一项链接的XPath
./@href

5、提取标题后，需要对用正则表达式进行筛选，因为页面源代码中有<em>与</em>标签，所以需要用正则表达式将该标签删除。因此，这里不能直接用XPath函数text()提取，而应该用extract直接提取源代码，然后用正则表达式从中提取需要的要素

        eles=response.xpath('//*[@class="t"]/a') #提取搜索每一项
        for ele in eles:
            name=ele.xpath('.').extract()    #提取标题相关的要素源码,extract方法返回一个List
            name=''.join(name).strip()        #要将List中的要素连接起来
            name=name.replace('<em>','').replace('</em>', '')#删除其中的<em>与</em>标签
            re_bd=re.compile(r'>(.*)</a>')#构建re compile
            item['name']=re_bd.search(name).groups(1)#筛选name项
            item['link']=ele.xpath('./@href').extract()[0]#直接提取Link
            yield item

6、完整代码如下

import scrapy
from scrapy import Request
from BD.items import BdItem
import re

class BdsSpider(scrapy.Spider):
    name = 'BDS'
    allowed_domains = ['www.baidu.com']
    key=input('输入关键字')
    url='http://www.baidu.com/s?wd='+key
    start_urls = [url]

    def parse(self, response):
        item=BdItem()
        eles=response.xpath('//*[@class="t"]/a')
        for ele in eles:
            name=ele.xpath('.').extract()
            name=''.join(name).strip()
            name=name.replace('<em>','').replace('</em>', '')
            re_bd=re.compile(r'>(.*)</a>')
            item['name']=re_bd.search(name).groups(1)
            item['link']=ele.xpath('./@href').extract()[0]
            yield item
        next_url = self.url + '&pn=10'
        yield Request(url=next_url)

7、运行

scrapy crawl BDS -O baidu.csv

其他

Setting中需要设置User-Agent，以避免被百度识别为爬虫而拒绝请求

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 爬虫-python（三）百度搜索关键词后爬取搜索结果爬虫与Python：（三）基本库的使用——扩展:百度搜索关键字提交 Python爬虫：通过关键字爬取百度图片使用php的curl根据关键词爬取百度搜索结果页 js 模拟百度关键字搜索与跳转 Python抓取百度搜索结果百度搜索引擎关键字URL采集爬虫优化行业定投方案高效获得行业流量-代码篇【Python爬虫】：爬取（谷歌/百度/搜狗）的搜索结果 python+selenium实现自动化百度搜索关键词 Python+Google Hacking+百度搜索引擎进行信息搜集