Python+Google Hacking+百度搜索引擎進行信息搜集

本文轉載自查看原文 2020-03-23 13:38 511 SRC信息搜集

記錄一下在用python爬取百度鏈接中遇到的坑：

1.獲取百度搜索頁面中的域名URL

BeautifulSoup獲取a標簽中href屬性后，鏈接為百度url，利用request直接訪問默認會直接進行跳轉，無法獲取所需域名

此時需要將requests的allow_redirects屬性設置為False，禁止進行跳轉，requests默認會進行跳轉

再使用.headers['Location']獲取最后鏈接:final_url = baidu_url.headers['Location']

2.百度中的鏈接返回不統一

獲取到實際域名鏈接后，發現還有一些奇怪的東西

訪問后發現非site搜集域名

突然想到，很有可能是百度的廣告

那就需要篩選出包含baidu.php?的鏈接去剔除掉

a="baidu.php?"

b="url"

if a in b:來進行篩選

3.百度安全驗證繞過

當在百度搜索鏈接中加入pn頁碼參數時，便會直接出現百度安全驗證（第一次訪問就會出現，並不存在請求頻繁）

但發現當手動在瀏覽器去百度進行site語法請求時，並不會出現百度安全驗證碼，猜想應該是有在HTTP請求頭或者參數中漏掉一些參數　　

對HTTP請求參數進行一系列不可描述的操作之后，發現還需要"bs"、"rsv_jmp"兩個參數

未添加這兩個參數時,還存在驗證，未獲得任何返回數據

添加之后，已成功獲取url

4.獲取查詢的總頁數並去進行遍歷

沒找到獲取總頁面的接口，每次請求最多顯示10個頁面鏈接，獲取之后的還需要去動態進行交互點擊

網上也沒找到好辦法，最后決定采用while循環，來固定遍歷前N個頁面

當tagh3長度值為0時直接跳出break

5.鏈接根域名的去重問題

設置set集合 lines_seen = set()

每次寫入url前判斷

附代碼：

tips：url_all.txt為自定義的google hacking語法

import re
import json
from bs4 import BeautifulSoup
import requests

def main(word):
    with open('url_all.txt', 'r') as f:
        hacking = f.read()
        f.close()
    url_list = hacking.split('\n')  # 生成列表
    url_list = filter(None, url_list)  # 去除空白單詞
    for siteurl in url_list:
        i = 0
        lines_seen = set()
        while i < 20:#遍歷前20頁獲取到的鏈接
            url = 'http://www.baidu.com/s?wd=site:' + str(word) +'%20'+ siteurl + '&pn='+str(10*i)+'&ie=utf-8&bs='+str(word) +'%20'+ siteurl+'&rsv_jmp=fail'
            i = i+1
            print(url)
            target_header = {
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
                'accept-language': 'zh-CN,zh;q=0.9',
                'cache-control': 'max-age=0',
                'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
            }
            response = requests.get(url, headers=target_header,
                                    timeout=10)  # .content.decode('utf-8')    #proxies = proxy 添加代理
            soup = BeautifulSoup(response.text, 'lxml')
            tagh3 = soup.find_all('h3')
            if len(tagh3) == 0:
                break
            else:
                print('正在爬取第'+str(i)+'頁')
                for h3 in tagh3:
                    href = h3.find('a').get('href')
                    if "baidu.php?" in href:
                        continue
                    else:
                        baidus_url = requests.get(url=href, headers=target_header, allow_redirects=False)
                        real_url = baidus_url.headers['Location']  # 得到網頁原始地址
                        if real_url not in lines_seen:
                            print(real_url)
                            lines_seen.add(real_url)
                            write_to_file(real_url)
                        else:
                            break
if __name__ == '__main__':
    word = input("關鍵詞：")
    main(word)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Alfred 3 如何設置默認搜索引擎（以百度搜索為例）《百度搜索引擎優化指南》筆記記一次火狐添加百度搜索引擎 2021各大搜索引擎蜘蛛IP段（百度、Google、搜狗、頭條、必應、神馬、360）百度搜索引擎關鍵字URL采集爬蟲優化行業定投方案高效獲得行業流量-代碼篇 Python抓取百度搜索結果別忘搜索：可用谷歌、百度等多個搜索引擎同時搜索海量數據搜索---demo展示百度、谷歌搜索引擎的實現 Python：輸入關鍵字進行百度搜索並爬取搜索結果百度搜索語法