Python 爬取單個網頁所需要加載的URL地址和CSS、JS文件地址

通過學習Python爬蟲，知道根據正式表達式匹配查找到所需要的內容（標題、圖片、文章等等）。而我從測試的角度去使用Python爬蟲，希望爬取到訪問該網頁所需要的CSS、JS、URL，然后去請求這些地址，根據響應的狀態碼判斷是否都可以成功訪問。

代碼

'''
Created on 2017-08-02  



@author: Lebb  

'''

import sys

import urllib2

import re

reload(sys)

sys.setdefaultencoding('utf-8')



url = "https://www.szrtc.cn/"

http = "http"

request = urllib2.Request(url,headers=Headers)

responsecode = None

errorcount = 0

itemurl = url



def getResponse():

    try:

        response = urllib2.urlopen(request)

    except urllib2.HTTPError,he:

        print he.code

    except urllib2.URLError,ue:

        print ue.reason

    else :

        return response.read().decode('utf-8')



def getUrl():

    html = getResponse()

    patterncss ='<link href="(.*?)"'

    patternjs = '<script src="(.*?)"'

    patternimg = '<img src="(.*?)"'

    patternpage = '<a.*?href="(.*?)"'

    patternonclick = "openQuestion.*?'(.*?)'"

    href = re.compile(patterncss, re.S).findall(html)

    href += re.compile(patternimg, re.S).findall(html)

    href += re.compile(patternpage, re.S).findall(html)

    href += re.compile(patternjs, re.S).findall(html)

    href += re.compile(patternonclick, re.S).findall(html)

return href



def reasonCode():

    global errorcount

    itemurl = getUrl()

    for item1 in itemurl:

        if http in item1:

            sendurl = item1

        else:

            sendurl = url + item1

        try:

            print sendurl

            responseurl = urllib2.urlopen(sendurl,timeout=8)

        except urllib2.HTTPError,he:

            responsecode = he.code

            errorcount += 1

        except urllib2.URLError,ue:

            responsecode = ue.reason

            errorcount += 1

        else:

            responsecode = responseurl.getcode()

            if(responsecode != 200):

                errorcount += 1    
           
        print responsecode

        #return responsecode

    print errorcount

運行的結果：
運行截圖

錯誤截圖：

實際上這條請求復制到瀏覽器是可以訪問的，但是Python 的urllib2訪問時，因為請求帶中文參數，沒有進行編碼轉換，導致報400錯誤。
嘗試在代碼中加入utf-8，還是沒有效果，仍然報錯。
這個問題先記下來，后面去找到其他解決辦法

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Echarts的js文件地址 Python爬蟲學習——使用selenium和phantomjs爬取js動態加載的網頁 Python爬取加載js的頁面 python+selenium+PhantomJS爬取網頁動態加載內容 python+selenium+PhantomJS爬取網頁動態加載內容 Python爬取網頁信息 python爬取動態網頁2，從JavaScript文件讀取內容爬蟲——爬取Ajax動態加載網頁菜鳥學IT之python網頁爬取多頁爬取前端js 爬取獲取網頁