跟小白學python網絡爬蟲實例3

本文轉載自查看原文 2017-10-12 21:00 4196 網絡爬蟲/ python/ 學習筆記

實例3--股票數據定向爬蟲

程序結構如下：

　　1.先從網站中獲取股票代號列表（requests庫，re庫）

　　2.遍歷每一只股票，從股票信息網站中獲得詳細信息

　　3.使用字典的數據結構，寫入文本文件中

以下為代碼：

 1 # 股票數據定向爬蟲
 2 """
 3 Created on Thu Oct 12 16:12:48 2017
 4 
 5 @author: DONG LONG RUI
 6 """
 7 import requests
 8 from bs4 import BeautifulSoup
 9 import re
10 #import traceback
11 
12 def getHTMLText(url,code='utf-8'):#參數code缺省值為‘utf-8’(編碼方式)
13     try:
14         r=requests.get(url,timeout=30)
15         r.raise_for_status()
16         #r.encoding=r.apparent_encoding
17         r.encoding=code
18         return r.text
19     except:
20         return ''
21     
22 def getStockList(lst,stockURL):
23     html=getHTMLText(stockURL,'GB2312')
24     soup=BeautifulSoup(html,'html.parser')
25     a=soup.find_all('a')
26     for i in a:
27         try:
28             href=i.attrs['href']
29             lst.append(re.findall(r'[s][hz]\d{6}',href)[0])
30         except:
31             continue
32     
33 def getStockInfo(lst,stockURL,fpath):
34     count=0#
35     for stock in lst:
36         url=stockURL+stock+'.html'
37         html=getHTMLText(url)
38         try:
39             if html=='':
40                 continue
41             infoDict={}
42             soup=BeautifulSoup(html,'html.parser')
43             stockInfo=soup.find('div',attrs={'class':'stock-bets'})
44             
45             name=stockInfo.find_all(attrs={'class':'bets-name'})[0]
46             infoDict.update({'股票名稱':name.text.split()[0]})#用空格分開，得到股票名稱
47             
48             keyList=stockInfo.find_all('dt')
49             valueList=stockInfo.find_all('dd')
50             for i in range(len(keyList)):
51                 key=keyList[i].text
52                 val=valueList[i].text
53                 infoDict[key]=val
54             
55             with open(fpath,'a',encoding='UTF-8') as f:
56                 f.write(str(infoDict)+'\n')
57                 count=count+1#
58                 print('\r當前進度：{:.2f}%'.format(count*100/len(lst)),end='')#動態顯示進度，‘\r’實現光標移動，即為不換行的效果
59         except:
60             count=count+1
61             print('\r當前進度：{:.2f}%'.format(count*100/len(lst)),end='')#動態顯示進度，‘\r’實現光標移動，即為不換行的效果
62             #traceback.print_exc()
63             continue
64 
65     
66 def main():
67     stock_list_url='http://quote.eastmoney.com/stocklist.html'
68     stock_info_url='https://gupiao.baidu.com/stock/'
69     output_file='C:/Users/DONG LONG RUI/.spyder-py3/BaiduStockInfo.txt'
70     slist=[]
71     getStockList(slist,stock_list_url)
72     getStockInfo(slist,stock_info_url,output_file)
73     
74 main()

由於requests庫爬蟲的限制，我運行后速度會比較慢，后續可嘗試scrapy爬蟲。

又想到bs4中的BeautifulSoup和re庫都可用於搜索html中的目標信息，但兩者一般結合起來使用：

　　先用BeautifulSoup找到目標信息所在的特定標簽，然后在這些標簽內容中使用正則表達式去匹配。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。