百度文庫文字爬取


 

  臨近畢業,學校要求寫實習報告,自己寫報告是不可能寫的,肯定是抄啊,百度文庫能給你白抄么,不會的,你要注冊會員,要花銀子才能復制他的文章,對於我們苦逼窮學生,就剩這點技術了,用python寫了個爬蟲,爬出來直接就可以在終端復制粘貼了,捐獻給各位同胞食用!

# 百度文庫信息爬取


import requests
import re
import json
headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Mobile Safari/537.36"
}  # 模擬手機


def get_num(url):
    response = requests.get(url, headers=headers).text
    # print(response)
    result = re.search(
        r'&md5sum=(.*)&sign=(.*)&rtcs_flag=(.*)&rtcs_ver=(.*?)".*rsign":"(.*?)",', response, re.M | re.I)  # 尋找參數
    # print(result.group(1),result.group(2),result.group(3),result.group(4),result.group(5))
    reader = {
        "md5sum": result.group(1),
        "sign": result.group(2),
        "rtcs_flag": result.group(3),
        "rtcs_ver": result.group(4),
        "width": 176,   
        "type": "org",
        "rsign": result.group(5)
    }

    result_page = re.findall(
        r'merge":"(.*?)".*?"page":(.*?)}', response)  # 獲取每頁的標簽
    doc_url = "https://wkretype.bdimg.com/retype/merge/" + url[29:-5]  # 網頁的前綴
    n = 0
    for i in range(len(result_page)):  # 最大同時一次爬取10頁
        if i % 10 is 0:
            doc_range = '_'.join([k for k, v in result_page[n:i]])
            reader['pn'] = n + 1
            reader['rn'] = 10
            reader['callback'] = 'sf_edu_wenku_retype_doc_jsonp_%s_10' % (
                reader.get('pn'))
            reader['range'] = doc_range
            n = i
            get_page(doc_url, reader)
    else:  # 剩余不足10頁的
        doc_range = '_'.join([k for k, v in result_page[n:i + 1]])
        reader['pn'] = n + 1
        reader['rn'] = i - n + 1
        reader['callback'] = 'sf_edu_wenku_retype_doc_jsonp_%s_%s' % (
            reader.get('pn'), reader.get('rn'))
        reader['range'] = doc_range
        get_page(doc_url, reader)


def get_page(url, data):
    response = requests.get(url, headers=headers, params=data)
    # print("response.status_code:",response.status_code,"\n","response.url:",response.url,"\n","response.headers:",response.headers,"\n","response.cookies:",response.cookies)
    response = response.text
    response = response.encode(
        'utf-8').decode('unicode_escape')  # unciode轉為utf-8 然后轉為中文
    response = re.sub(r',"no_blank":true', '', response)  # 清洗數據
    result = re.findall(r'c":"(.*?)"}', response)  # 尋找文本匹配
    result = '\n'.join(result)
    print(result)

if __name__ == '__main__':
    url = "" #這里寫入想要爬取的文章URL,直接粘貼進來
    get_num(url)

  爬取效果圖如下:


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM