python爬蟲一之爬取分頁下的內容

本文轉載自查看原文 2017-08-29 15:40 1305 python爬蟲

python爬蟲之爬去分頁下的內容

　　　　　　　　　　　　　　　　　　　　　--chenjianwen

　　思想轉換：最近一直在弄爬蟲，感覺非常有意思。但中間常遇到一些苦惱的事情，比如網站分頁的這個事情。之前看到分頁總是要去看它的總頁碼，然后再定義range(),再用for循環去歷遍拼接url，慢慢的感覺這個做法很low。所以也苦惱了一陣子，各種找資料也找不到相對應的方法。還好，在今天搞定它了.

　　但是過兩天學習了多進程http://www.cnblogs.com/chenjw-note/articles/7454218.html后，反而覺得這個方法的速度太慢了，還是去拿它的總頁碼比較快............

╭︿︿︿╮

{/ $ $ /}

( (oo) )

︶︶︶

　　思路實現：從主頁爬取並匹配到下一頁的鏈接 --> 爬取下一頁的所需的信息 --> 並從下一頁中得到下一頁的鏈接（此處使用while循環實現）--> 最后在while循環中執行for循環爬取內容

1.分頁使用while循環獲取

2.獲取詳細信息在while循環中執行for循環

3.具體事例及注釋，url：http://www.szwj72.cn/article/hsyy/Index.html

#!/usr/bin/evn python
# _*_ coding: utf-8 _*_

import urllib2,urllib,requests
import re
import json
import os
import time,datetime
import threading
##解決python2.7編碼問題
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

url = 'http://www.szwj72.cn/article/hsyy/Index.html'
send_headers = {
 'Host':'www.szwj72.cn',
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
 'Connection':'keep-alive'
}
#匹配下一頁鏈接
req = urllib2.Request(url,headers=send_headers)
html = urllib2.urlopen(req).read().decode('gbk','ignore').encode('utf-8')
reg = re.compile(r'<p class="(.*?)</a> <a href="//(www.szwj72.cn/.*?.html">下一頁)</a> <a href="//www.szwj72.cn/Article/hsyy/index_100.html">尾頁</a></p>')

#獲取文章鏈接
def getAtUrl(next_page):
    req = urllib2.Request(next_page,headers=send_headers)
    html = urllib2.urlopen(req).read().decode('gbk','ignore').encode('utf-8')
    reg = re.compile(r'<h2><a href="(.*?)" title="(.*?)">(.*?)</a></h2>')
    return re.findall(reg,html)

while True:
    #print re.findall(reg,html)[0][1].split('<a')[-1].split('href="//')[1].split('">下一頁')[0]
    #得到下一頁地址
    next_page = 'http://' + re.findall(reg,html)[0][1].split('<a')[-1].split('href="//')[1].split('">下一頁')[0]   #這個匹配切割感覺很無奈，哈哈~~~
    #print  next_page
    #把下一頁地址傳給getAtUrl()函數，並取得文章鏈接
    for i in getAtUrl(next_page):
        print i[0],i[1]
    #重新請求新頁地址，並得到下一頁的新地址
    req = urllib2.Request(next_page,headers=send_headers)
    html = urllib2.urlopen(req).read().decode('gbk','ignore').encode('utf-8')
    #循環直到頁碼耗盡
    #break

╭︿︿︿╮ {/ o o /} ( (oo) ) ︶︶︶

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python爬蟲】之爬取頁面內容、圖片以及用selenium爬取 Python爬蟲爬取貼吧的帖子內容 windows下使用python的scrapy爬蟲框架，爬取個人博客文章內容信息 Python爬蟲爬取搜狗搜索到的內容頁面 python爬蟲-爬取天氣預報內容 Python寫網絡爬蟲爬取騰訊新聞內容 python爬蟲實戰（六）--------新浪微博（爬取微博帳號所發內容，不爬取歷史內容） python爬蟲之爬取漫畫（一） python爬蟲（爬取視頻） python爬蟲之爬取小說（一）