1. 目的

使用爬蟲腳本爬去百度搜索關鍵字后獲得鏈接地址以及域名信息

可結合GHDB語法

e.g. inrul:php?id=

2. 知識結構

2.1 使用 threading & queue 模塊，多線程處理，自定義線程數

2.2 使用BeautifulSoup & re模塊，處理href 匹配

2.3 使用requests 模塊，處理web請求&獲得請求后的真實地址（r.url）

2.4 百度最大搜索頁面76頁，pn max 760

2.5 將結果存入文本，域名已去重

3. 爬蟲腳本

#coding=utf-8

import requests
import re
import Queue
import threading
from bs4 import BeautifulSoup as bs
import os,sys,time

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}


class BaiduSpider(threading.Thread):
    def __init__(self,queue):
        threading.Thread.__init__(self)
        self._queue = queue
    def run(self):
        while not self._queue.empty():
            url = self._queue.get_nowait()
            try:
                #print url
                self.spider(url)
            except Exception,e:
                print e
                pass

    def spider(self,url):
    #if not add self , error:takes exactly 1 argument (2 given)    
        r = requests.get(url=url,headers=headers)
        soup = bs(r.content,'lxml')
        urls = soup.find_all(name='a',attrs={'data-click':re.compile(('.')),'class':None})
        for url in urls:
            #print url['href']
            new_r = requests.get(url=url['href'],headers=headers,timeout=3)
            if new_r.status_code == 200 :
                url_para = new_r.url
                url_index_tmp = url_para.split('/')
                url_index = url_index_tmp[0]+'//'+url_index_tmp[2]
                print url_para+'\n'+url_index
                with open('url_para.txt','a+') as f1:
                    f1.write(url_para+'\n')
                with open('url_index.txt','a+') as f2:
                    with open('url_index.txt', 'r') as f3:
                        if url_index not in f3.read():
                            f2.write(url_index+'\n')
            else:
                print 'no access',url['href']

def main(keyword):
    queue = Queue.Queue()
    de_keyword = keyword.decode(sys.stdin.encoding).encode('utf-8')
    print keyword
    # baidu max pages 76 , so pn=750 max
    for i in range(0,760,10):
        #queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(keyword,i))
        queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(de_keyword,i))
    threads = []
    thread_count = 4
    for i in range(thread_count):
        threads.append(BaiduSpider(queue))
    for t in threads:
        t.start()
    for t in threads:
        t.join()

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print 'Enter:%s keyword'%sys.argv[0]
        sys.exit(-1)
    else:
        main(sys.argv[1])

效果圖

4. 待優化點

4.1 多個搜索引擎的處理

4.2 多參數的處理

4.2 payload 結合

5. 參考信息

5.1. ADO ichunqiu Python安全工具開發應用

5.2. https://github.com/sharpdeep/CrawlerBaidu/blob/master/CrawlerBaidu.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲，百度搜索熱點排行百度搜索語法百度搜索語法大全百度搜索的使用技巧 Python抓取百度搜索結果怎樣在百度搜索到自己的博客百度搜索技巧百度搜索常用技巧百度搜索常用api 百度搜索的高級用法