爬蟲-Python爬蟲常用庫


一、常用庫

1、requests 做請求的時候用到。

requests.get("url")

2、selenium 自動化會用到。

3、lxml

4、beautifulsoup

5、pyquery 網頁解析庫 說是比beautiful 好用,語法和jquery非常像。

6、pymysql 存儲庫。操作mysql數據的。

7、pymongo 操作MongoDB 數據庫。

8、redis 非關系型數據庫。

9、jupyter 在線記事本。

二、什么是Urllib

Python內置的Http請求庫

urllib.request 請求模塊    模擬瀏覽器

urllib.error 異常處理模塊

urllib.parse url解析模塊    工具模塊,如:拆分、合並

urllib.robotparser robots.txt    解析模塊  

 

2和3的區別

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

 

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

用法:

urlOpen 發送請求給服務器。

urllib.request.urlopen(url,data=None[參數],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

 例子:

例子1:

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

 

  例子2:

  import urllib.request

  import urllib.parse

  data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

  response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

  print(response.read())

  注:加data就是post發送,不加就是以get發送。

 

  例子3:

  超時測試

  import urllib.request

  response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

  print(response.read())

  -----正常

  import socket

  import urllib.reqeust

  import urllib.error

  try:

    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

  except urllib.error.URLError as e:

    if isinstance(e.reason,socket.timeout):

      print('TIME OUT')

  這是就是輸出 TIME OUT

 

 響應

 響應類型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

 輸出:print(type(response))

 

     

   狀態碼、響應頭

   import urllib.request

   response = urllib.request.urlopen('http://www.python.org')

   print(response.status)  // 正確返回200 

   print(response.getheaders())    //返回請求頭

     print(response.getheader('Server'))  

 

三、Request     可以添加headers

   import urllib.request

  request=urllib.request.Request('https://python.org')

  response=urllib.request.urlopen(request)

  print(response.read().decode('utf-8'))

 

 

  例子:

   from urllib import request,parse

  url='http://httpbin.org/post'

  headers={

    User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36
    Host:httpbin.org

  }

  dict={

    'name':'Germey'

  }

 

  data=bytes(parse.urlencode(dict),encoding='utf8')

  req= request.Request(url=url,data=data,headers=headers,method='POST')

  response = request.urlopen(req)

  print(response.read().decode('utf-8'))

 

 

四、代理

   import urllib.request

  proxy_handler =urllib.request.ProxyHandler({

    'http':'http://127.0.0.1:9743',

    'https':'http://127.0.0.1:9743',

  })

  opener =urllib.request.build_opener(proxy_handler)

   response= opener.open('http://httpbin.org/get')

  print(response.read())

 

 

五、Cookie

   import http.cookiejar,urllib.request

  cookie = http.cookiejar.Cookiejar()

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener = urllib.request.build_opener(handler)

  response = opener.open('http://www.baidu.com')

  for item in cookie:

    print(item.name+"="+item.value)

 

  第一種保存cookie方式

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'  

  cookie =http.cookiejar.MozillaCookieJar(filename)

  handler= urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response= opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

 

  第二種保存cookie方式

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'

  cookie =http.cookiejar.LWPCookieJar(filename)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

  讀取cookie

  import http.cookiejar,urllib.request

  cookie=http.cookiejar.LWPCookieJar()

  cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  print(response.read().decode('utf-8'))

 

 

 六、異常處理

  例子1:

   from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.URLError as e:

    print(e.reason)  //url異常捕獲

 

  例子2:

  from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.HTTPError as e:

    print(e.reason,e.code,e.headers,sep='\n')  //url異常捕獲

  except error.URLError as e:

    print(e.reason)  

  else:

    print('Request Successfully')

 

 

7、URL解析

   urlparse   //url 拆分

  urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

  

  例子:

  from urllib.parse import urlparse    //url 拆分

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

  print(type(result),result)

   結果:

  

 

   例子2:

  from urllib.parse import urlparse   //沒有http

  result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

     print(result)

  

 

   例子3:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

   print(result)

   

 

   例子4:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

   print(result)

   

 

   例子5:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

   print(result)

   

 

 

 七、拼接  

  urlunparse

   例子:

  from urllib.parse import urlunparse

  data=['http','www.baidu.com','index.html','user','a=6','comment']

  print(urlunparse(data))

   

 

   urljoin

   from urllib.parse import urljoin

  print(urljoin('http://www.baidu.com','FAQ.html'))

  

  后面覆蓋前面的

 

  urlencode

  from urllib.parse import urlencode

  params={

    'name':'gemey',

    'age':22

  }

  base_url='http//www.baidu.com?'

  url = base_url+urlencode(params)

  print(url)

  http://www.baidu.com?name=gemey&age=22

 

 

example:

urllib是Python自帶的標准庫,無需安裝,直接可以用。
提供了如下功能:

  • 網頁請求
  • 響應獲取
  • 代理和cookie設置
  • 異常處理
  • URL解析

爬蟲所需要的功能,基本上在urllib中都能找到,學習這個標准庫,可以更加深入的理解后面更加便利的requests庫。

urllib庫

urlopen 語法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:訪問的網址 #data:額外的數據,如header,form data 

用法

# request:GET import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) # request: POST # http測試:http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超時設置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

響應

# 響應類型 import urllib.open response = urllib.request.urlopen('https:///www.python.org') print(type(response)) # 狀態碼, 響應頭 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server')) 

Request

聲明一個request對象,該對象可以包括header等信息,然后用urlopen打開。

# 簡單例子 import urllib.request request = urllib.request.Requests('https://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) # 增加header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 構造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者隨后增加header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8')) 

Handler:處理更加復雜的頁面

官方說明
代理

import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'http://127.0.0.1:9743' 'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read()) 

Cookie:客戶端用於記錄用戶身份,維持登錄信息

import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value) # 保存cooki為文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存類型有很多種 ## 類型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 類型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相應的方法讀取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") 

異常處理

捕獲異常,保證程序穩定運行

# 訪問不存在的頁面 from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.URLError as e: print(e.reason) # 先捕獲子類錯誤 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判斷原因 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

URL解析

主要是一個工具模塊,可用於為爬蟲提供URL。

urlparse:拆分URL

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True) # scheme: 協議類型 # 是否忽略’#‘部分 

舉個例子

from urllib import urlparse result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580") result ##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='') 

urlunparse:拼接URL,為urlparse的反向操作

from urllib.parse import urlunparse data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data)) 

urljoin:拼接兩個URL

 
urljoin

urlencode:字典對象轉換成GET請求對象

from urllib.parse import urlencode params = { 'name':'germey', 'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url) 

最后還有一個robotparse,解析網站允許爬取的部分。



作者:hoptop
鏈接:https://www.jianshu.com/p/cfbdacbeac6e
來源:簡書
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM