爬蟲學習：request+xpath爬取筆趣閣小說

本文轉載自查看原文 2019-07-04 11:33 1452 網絡爬蟲/ 網頁前端/ Chrome/ Python

爬蟲入坑一段時間了，准備搞點事，嘿嘿

注意：閱讀本文要有一定的python基礎，了解Requests和Xpath相關語法，以及正則表達式

1.關於Requests和Xpath

Requests

Requests是用python語言基於urllib編寫的，采用的是Apache2 Licensed開源協議的HTTP庫
如果你看過文章關於urllib庫的使用，你會發現，其實urllib還是非常不方便的，而Requests它會比urllib更加方便，可以節約我們大量的工作。（用了requests之后，你基本都不願意用urllib了）一句話，requests是python實現的最簡單易用的HTTP庫，建議爬蟲使用requests庫。

Xpath

XPath即為 XML路徑語言（XML Path Language），它是一種用來確定XML文檔中某部分位置的語言。

XPath基於XML的樹狀結構，提供在數據結構樹中找尋節點的能力。起初XPath的提出的初衷是將其作為一個通用的、介於 XPointer與 XSL間的語法模型。但是XPath很快的被開發者采用來當作小型查詢語言。

2.代碼

#正則+request+xpath
from lxml import etree
import requests
import re
import warnings
import time

warnings.filterwarnings("ignore")
headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

def get_urls(URL):
    Html=requests.get(URL,headers=headers,verify=False)
    Html.encoding = 'gbk'
    HTML=etree.HTML(Html.text)
    results=HTML.xpath('//dd/a/@href')
    return results

def get_items(result):
    url='https://www.biquyun.com'+str(result)
    html=requests.get(url,headers=headers,verify=False)
    html.encoding = 'gbk'
    pattern=re.compile('<div.*?<h1>(.*?)</h1>.*?<div.*?content">(.*?)</div>',re.S)
    items='\n'*2+str(re.findall(pattern,html.text)[0][0])+'\n'*2+str(re.findall(pattern,html.text)[0][1])
    items=items.replace('&nbsp;&nbsp;&nbsp;&nbsp;','').replace('<br />','')
    return items
    
def save_to_file(items):
    with open ("xiaoshuo1.txt",'a',encoding='utf-8') as file:
        file.write(items)    
    
def main(URL):
    results=get_urls(URL)
    ii=1
    for result in results:
        items=get_items(result)
        save_to_file(items)
        print(str(ii)+' in 1028')
        ii=ii+1
#        time.sleep(1)
if __name__ == '__main__':
    start_1 = time.time()
    URL='https://www.biquyun.com/15_15566/'
    main(URL)
    print('Done!')
    end_1 = time.time()
    print('爬蟲時間1:',end_1-start_1)

運行結果：

#requests+xpath
from lxml import etree
import requests
import re
import warnings
import time

warnings.filterwarnings("ignore")#由於requests獲取網頁源代碼采用verify=False，需要忽略警告
headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

def get_urls(URL):
    Html=requests.get(URL,headers=headers,verify=False)
    Html.encoding = 'gbk'
    HTML=etree.HTML(Html.text)
    results=HTML.xpath('//dd/a/@href')
    return results

def get_items(result):
    url='https://www.biquyun.com'+str(result)
    html=requests.get(url,headers=headers,verify=False)
    html.encoding = 'gbk'
    html=etree.HTML(html.text)
    resultstitle=html.xpath('//*[@class="bookname"]/h1/text()')
    resultsbody=html.xpath('//*[@id="content"]/text()')
    items=str(resultstitle[0])+'\n'*2+str(resultsbody).replace('\', \'','').replace('\\xa0\\xa0\\xa0\\xa0','').replace('\\r\\n\\r\\n','\n\n').replace('[\'','').replace('\']','')+'\n'*2
    return items

def save_to_file(items):
    with open ("xiaoshuo2.txt",'a',encoding='utf-8') as file:
        file.write(items)    
    
def main(URL):
    results=get_urls(URL)
    ii=1
    for result in results:
        items=get_items(result)
        save_to_file(items)
        print(str(ii)+' in 1028')
        ii=ii+1
#        time.sleep(1)
if __name__ == '__main__':
    start_2 = time.time()
    URL='https://www.biquyun.com/15_15566/'
    main(URL)
    print('Done!')
    end_2 = time.time()
    print('爬蟲時間2:',end_2-start_2)

運行結果：

ps：具體爬取速度與電腦配置和網速有關。另外，利用正則匹配時間有時候會很長，建議采用xpath。

編寫爬蟲的坑：

1.爬取網頁中文亂碼

解決方案：

print(response.encoding)  # requests猜測的編碼格式
print(requests.utils.get_encodings_from_content(response.text)[0])

參考鏈接：

http://cn.python-requests.org/zh_CN/latest/

https://www.liaoxuefeng.com/wiki/1016959663602400/1183249464292448

http://www.w3school.com.cn/xpath/index.asp

https://www.cnblogs.com/lei0213/p/7506130.html

https://blog.csdn.net/ahua_c/article/details/80942726

https://www.bilibili.com/video/av19057145

https://www.crifan.com/python_re_search_vs_re_findall/

https://www.jianshu.com/p/4c076da1b7f7

https://blog.csdn.net/u014109807/article/details/79735400

https://www.cnblogs.com/bjdx1314/p/8934031.html

https://www.cnblogs.com/ConnorShip/p/9966290.html

https://www.cnblogs.com/carlos-mm/p/8819519.html

https://blog.51cto.com/13603552/2308728

https://blog.csdn.net/sinat_35360663/article/details/78455991

https://www.cnblogs.com/lei0213/p/7506130.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲大作業之爬取筆趣閣小說 python爬取筆趣閣小說 Python 爬取筆趣閣小說 Python爬蟲練習（一）爬取筆趣閣小說（搜索+爬取） Python爬蟲入門教程02：筆趣閣小說爬取 Jsoup-基於Java實現網絡爬蟲-爬取筆趣閣小說 c#爬取筆趣閣小說（附源碼）【爬蟲】對新筆趣閣小說進行爬取，保存和下載 python爬蟲學習---記錄爬取筆趣閣的經歷（python3.6） java多線程爬取筆趣閣所有小說（請准備夠大的硬盤）