B站標題/子標題/url爬取示例(requests+re)

本文轉載自查看原文 2017-09-27 21:43 1795 爬蟲/ B站/ python crawl/ crawl

 #coding:utf-8
 2 __author__ = "zhoumi"
 3
 
 4 import requests
 5 import re
 6 import urllib
 7 '''
 8 本文檔目的在於獲得：
 9 1、一級目錄與其對應鏈接的字典,如下形式
10     dictinfo = {一級目錄:鏈接}
11 2、二級目錄與其對應鏈接的字典,如下形式
12     dict2info = {二級目錄:鏈接}
13 3、一級目錄與二級目錄對應的字典，如下形式
14     dict3info = {一級目錄:[二級目錄]}
15 '''
16 
17 #獲得待解析的頁面
18 #首先用raise_for_status處理異常：若請求不成功，拋出異常
19 def getText(url):
20         source = requests.get(url)
21         source.raise_for_status()
22         source.encoding = source.apparent_encoding
23         return(source.text)
24 
25 #返回分類名(keys)及對應鏈接(value)的字典
26 #dictinfo = {name1list:html1list}
27 #例如：動畫:www.bilibili.donghua.com,........
28 def getfirsttitle(source):
29     text = re.findall(r'a class.*?div class',source)
30     namelist = []
31     htmllist = []
32     dictinfo = {}
33     for i in text:
34         namelist.append(i.split("><em>")[1].split("</em>")[0])
35         htmllist.append(i.split('href="//')[1].split('"><em>')[0])
36     for i in range(len(namelist) - 1):
37         dictinfo[namelist[i]] = htmllist[i]
38     return dictinfo
39 
40 #返回二級分類的keys(分類名)和values(對應鏈接)的字典
41 #dict2info = {name2list:html2list}
42 def getsecondtitle(source):
43     text2 = re.findall(r'a href.*?<em></em></b></a></li>',source)
44     name2list = []
45     html2list = []
46     dict2info = {}
47     for i in text2:
48         name2list.append(i.split('><b>')[1].split('<em>')[0])
49         html2list.append(i.split('a href="//')[1].split('"><b>')[0])
50     for i in range(len(name2list) - 1):
51         dict2info[name2list[i]] = html2list[i]
52     return dict2info
53 
54 #獲得一級分類和二級分類的分類名的字典
55 #dict3info = {name1list:[name2list]}
56 def getfirst2second(source):
57     text3 = re.findall(r'"m-i".*?</ul',source,re.S)
58     dict3info = {}
59     middletitle = []
60     for i in text3:
61         #獲得出各個一級標題
62         title = i.split('><b>')[0].split('</em>')[0].split('<em>')[1]
63         #獲得各一級標題的子標題
64         childtitle = i.split('><b>')
65         dict3info[title] = childtitle
66         for j in range(len(childtitle) - 1):
67             childtitle[j] = childtitle[j + 1]
68         #處理冗余
69         childtitle.pop()
70         for k in childtitle:
71             middletitle.append(k.split('<em>')[0])
72         #每處理完一個title的childtitle，就執行存儲語句
73         dict3info[title] = middletitle
74         #初始化傳遞列表
75         middletitle = []
76     return dict3info
77 
78 
79 #——————————————————————————————————————————————

80 ##導入字典{二級分類名：urls2}計划使用urllib庫
81 '''
82 url為dict_2_url2字典里面的url2
83 本文本塊目的在於獲取二級分類頁面的源視頻鏈接和視頻名稱
84 並生成最終可調用字典{source_name:source_url}
85 
86 url = dict_2_urls.values()
87 '''
88 
89 def gettext(url):
90     source = requests.get(url)
91     source.raise_for_status()
92     source.encoding = source.apparent_encoding
93     return source.text
94 
95 def download(source):
96     text = re.findall(r'<video> src="blob:.*?"></video>',source)
97     html = text.split('<video> src="')[1].split('"></video>')[0]
98     pass

這是今兩天瞎鼓搗弄出來的代碼，函數名、變量名的定義存在問題。

最開始利用requests.get(url)獲得文本之后，不明白為什么需要text._raise_for_status()這一句代碼，后來明白這個是為了處理向url發出response請求時的異常處理，具體是什么處理不太明白。

其中，text.encoding = text.apparent.encoding的實現原理也沒有深究，需要慢慢積累。

requests作為一個第三方庫，提供的是一種便利的功能，但是學習這幾天之后，我發現這個並不太適合初學者，深層次的才是基礎的，所以我覺得需要好好了解一下urllib這個模塊。

之后，我准備嘗試使用urllib模塊對下載下來的文本進行處理，urlretrieve函數，urllib.request.urlopen函數等.

還遇見一個問題，當我准備利用字典里面的視頻鏈接下載b站的視頻時，會顯示如下結果：

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5}{\x93\x1bE\xb2\xef\xdf8\xe2|\x87^\xb1\xc1\x8c\x03\xeb9\x9a\x97\xf1\x0c\x07\x0c\xdcC\x1cX\xd8\xc5

我的源碼是：

1 import urllib.request
2 import urllib.parse
3 
4 def gettext(url):
5     source = urllib.request.urlopen(url,timeout=30)
6     return source.read()
7 url = 'https://www.bilibili.com/video/av11138658/'
8 text = gettext(url)
9 print(text)

百思不得其解，最后我把原因歸結為B站視頻有做過加密處理，入門不足一個月的小白還沒有能力解決這個問題~~~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 b站評論爬取 Python 自動爬取B站視頻 python B站彈幕爬取 python爬取b站排行榜 Java + golang 爬取B站up主粉絲數 Python爬蟲一爬取B站小視頻源碼爬蟲---爬取b站小視頻 Python實戰爬蟲——B站封面爬取 python爬蟲——爬取B站用戶在線人數 python 爬取B站視頻彈幕信息