python urllib2

本文轉載自查看原文 2012-03-14 14:04 6565 python/ urllib2

當處理HTTP鏈接的時候，鏈接如果有中文的話，那么發起HTTP鏈接的時候，一定要先把URL編碼，否則就會出現問題。而在python中，用 urllib2.quote(URL)進入編碼和urllib2.unquote(URL)解碼的時候，有一點需要注意，就是URL字符串不能是 unicode編碼，此時必須把URL編碼轉換成適當的編碼，如utf-8或gb2312等而python處理編碼轉換的機制如下：原來編碼》內部編碼》目的編碼 python的內部編碼是使用unicode來處理的 gb=”中國”#此處為原本gb2312編碼 uni=unicode(gb,’gb2312′)#把gb2312編碼轉換成unicode的內部編碼 utf=uni.encode(’utf-8′)#把unicode編碼轉換成utf-8目的編碼在處理wxpython文本框的時候要注意，默認的編碼是unicode編碼，利用urllib.quote編碼時候，可以通過如下方面轉換后，再進行 URL編碼 URL=wxpython文本框原本的unicode編碼 URL=URL.encode(’utf-8′)#把unicode編碼轉換成utf-8編碼 URL＝urllib2.quote(URL)#進入URL編碼，以便HTTP鏈接

對我來說，Python里面哪個模塊用的最多，恐怕urllib2這個不是第一也得算前三了。先看下下面最常用的代碼
Python語言:

import urllib2
req = urllib2.Request("http://www.g.cn")

res = urllib2.urlopen( req )
html = res.read()
res.close()

這里可以通過urllib2進行抓取頁面。也可以直接使用urllib2.urlopen( http://www.g.cn),通過Reques對象打開的好處是，我們可以很方便的為Reques 添加HTTP 請求的頭部信息。
Python語言:
import urllib2
req = urllib2.Request("http://www.g.cn")
req.add_header( "Cookie" , "aaa=bbbb" ) # 這里通過add_header方法很容易添加的請求頭
req.add_header( "Referer", "http://www.fayaa.com/code/new/" )
res = urllib2.urlopen( req )
html = res.read()
res.close()

headers 初始化為{} ，所有如果連續執行兩次req.add_header( "Cookie" , "aaa=bbbb" ) ，則后面的值會把前面的覆蓋掉
class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False):

當執行 res = urllib2.urlopen( req ) 時

_opener = None
def urlopen(url, data=None):
    global _opener
    if _opener is None:
        _opener = build_opener()
    return _opener.open(url, data)
_opener = build_opener() 這里_opener 是一個全局變量。第一次使用時，通過build_opener() 得到一個值，以后再次使用就是保存到這個全局變量中值。
def build_opener(*handlers):
    """Create an opener object from a list of handlers.

    The opener will use several default handlers, including support
    for HTTP and FTP.

    If any of the handlers passed as arguments are subclasses of the
    default handlers, the default handlers will not be used.
    """
    import types
    def isclass(obj):
        return isinstance(obj, types.ClassType) or hasattr(obj, "__bases__")

    pener = OpenerDirector()
    default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
                       HTTPDefaultErrorHandler, HTTPRedirectHandler,
                       FTPHandler, FileHandler, HTTPErrorProcessor]
    if hasattr(httplib, 'HTTPS'):
        default_classes.append(HTTPSHandler)
    skip = []
    for klass in default_classes:
        for check in handlers:
            if isclass(check):
                if issubclass(check, klass):
                    skip.append(klass)
            elif isinstance(check, klass):
                skip.append(klass)
    for klass in skip:
        default_classes.remove(klass)

    for klass in default_classes:
        opener.add_handler(klass())

    for h in handlers:
        if isclass(h):
            h = h()
        opener.add_handler(h)
    return opener

這里就可以看到默認的處理程序有 ProxyHandler, 代理服務器處理 UnknownHandler, HTTPHandler, http協議的處理 HTTPDefaultErrorHandler, HTTPRedirectHandler, http的重定向處理 FTPHandler, FTP處理 FileHandler, 文件處理 HTTPErrorProcessor

我們也可以添加自己處理程序

cookie = cookielib.CookieJar()

urllib2.HTTPCookieProcessor(cookie) 這個就是對cookie的處理程序
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

添加后就可以對每次收到響應中的Set-Cookie 記錄到cookie 對象中，下次發送請求的時候就可以把這些Cookies附加到請求中
urllib2.install_opener(opener) 用我們生成的opener 替換掉urllib2中的全局變量

比如第一次請求:

connect: (www.google.cn, 80)
send: 'GET /webhp?source=g_cn HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.cn\r\nConnection: close\r\nUser-Agent: Python-urllib/2.5\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private, max-age=0
header: Date: Sun, 21 Dec 2008 13:47:39 GMT
header: Expires: -1
header: Content-Type: text/html; charset=GB2312
header: Set-Cookie: PREF=ID=5d750b6ffc3d7d04:NW=1:TM=1229867259:LM=1229867259:S=XKoaKmsjYO_-CsHE; expires=Tue, 21-Dec-2010 13:47:39 GMT; path=/; domain=.google.cn
header: Server: gws
header: Transfer-Encoding: chunked
header: Connection: Close

第二次請求中就會附加

Cookie: PREF=ID=5d750b6ffc3d7d04:NW=1:TM=1229867259:LM=1229867259:S=XKoaKmsjYO_-CsHE等Cookie

connect: (www.google.cn, 80)
send: 'GET /webhp?source=g_cn HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.cn\r\nCookie: PREF=ID=5d750b6ffc3d7d04:NW=1:TM=1229867259:LM=1229867259:S=XKoaKmsjYO_-CsHE\r\nConnection: close\r\nUser-Agent: Python-urllib/2.5\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private, max-age=0
header: Date: Sun, 21 Dec 2008 13:47:41 GMT
header: Expires: -1
header: Content-Type: text/html; charset=GB2312
header: Server: gws
header: Transfer-Encoding: chunked
header: Connection: Close

如果想要在urllib中啟用調試，可以用

>>> import httplib
>>> httplib.HTTPConnection.debuglevel = 1
>>> import urllib

但是在urllib2中無效，urllib2中沒有發現很好的啟用方法因為，

class AbstractHTTPHandler(BaseHandler):

    def __init__(self, debuglevel=0):
        self._debuglevel = debuglevel

會默認把調試級別改成0

我是這樣用實現的，

cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
opener.handle_open["http"][0].set_http_debuglevel(1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python urllib與urllib2 Python的urllib和urllib2模塊 Python urllib2 模塊 python urllib2的proxyhandler Python urllib2 proxy python2.7 urllib和urllib2 python爬蟲入門（一）urllib和urllib2 python3的urllib以及urllib2的報錯問題 python urllib2 cookie 設置 Python爬蟲(二)_urllib2的使用