利用python爬取深圳證券交易所財報

本文轉載自查看原文 2021-10-26 10:08 1562 python/ python小項目

　　開始爬取前需要安裝一下 pandas 和 request 庫，我用的是 pip 安裝。代碼如下：

pip install pandas
pip install requests

　　安裝完成后就要正式開始爬取數據了。

　　打開深圳證券交易所官網：

　　http://www.szse.cn/

　　在信息披露->上市公司信息->上市公司公告這里可以找到所有上市公司的財報，也可以加上一些篩選條件，比如我這次爬取的就是2020.1.1-2020.12.31日之間的所有半年報。按F12打開開發者調試工具（我用的是chrome）。輸入篩選條件。

　　點查詢后就可以看到瀏覽器向服務器發送的請求了，點開紅線畫出來的地方，如下。可以看到瀏覽器請求了一個Payload，里面有seDate等內容，多嘗試幾次可以發現一些規律。seDate里就是篩選的時間范圍，bigCategoryId里面就是篩選的類別，可以是年度報告，季度報告等等。半年財報就是‘010303’。隨便找到一個財報下載下來，可以在開發者調試工具里看到下載的鏈接，多找幾個數據就可以發現他們有一個共同的頭：‘http://disc.static.szse.cn/download/’。

　　接下來只需要簡簡單單的寫一個程序就好了。

# 定義爬取函數，參數為爬取第幾頁數據
def get_pdf_address(pageNum):
    url = 'http://www.szse.cn/api/disc/announcement/annList?random=%s' % random.random()
    #headers= {'User-Agent':str(UserAgent().random)}

    headers = {'Accept': 'application/json, text/javascript, */*; q=0.01'
    ,'Accept-Encoding': 'gzip, deflate'
    ,'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
    ,'Content-Type': 'application/json'
    ,'Host': 'www.szse.cn'
    ,'Origin': 'http://www.szse.cn'
    ,'Proxy-Connection': 'close'
    ,'Referer': 'http://www.szse.cn/disclosure/listed/fixed/index.html'
    ,'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/74.0.3729.169 Safari/537.36'
    ,'X-Request-Type': 'ajax'
    ,'X-Requested-With': 'XMLHttpRequest'}

    pagenum = int(pageNum)
    payload = {"seDate":["2020-01-01","2020-12-31"],"channelCode":["fixed_disc"],"bigCategoryId":["010303"],"pageSize":30,"pageNum":pagenum}
    response = requests.post(url,headers =headers,data = json.dumps(payload)) #使用json格式
    result = response.json()
    return result

　　具體爬取下來的內容可以嘗試打印出來，這里就不放圖了。我們需要把獲得的數據整理一下，我用了pandas庫，它可以直接寫到excel里面，很方便。調用at()方法可以往里面加數據。這樣我們就可以把股票代碼，名稱，下載鏈接等數據整理出來保存在excel里了。直接上完整版的代碼。

'''
爬取深圳證券交易所財報地址
每一頁有30個財報
每10頁手動保存一次,防止被發現
'''
import requests
import time
import pandas as pd
import random
import os
import json

# 定義爬取函數
def get_pdf_address(pageNum):
    url = 'http://www.szse.cn/api/disc/announcement/annList?random=%s' % random.random()
    #headers= {'User-Agent':str(UserAgent().random)}

    headers = {'Accept': 'application/json, text/javascript, */*; q=0.01'
    ,'Accept-Encoding': 'gzip, deflate'
    ,'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
    ,'Content-Type': 'application/json'
    ,'Host': 'www.szse.cn'
    ,'Origin': 'http://www.szse.cn'
    ,'Proxy-Connection': 'close'
    ,'Referer': 'http://www.szse.cn/disclosure/listed/fixed/index.html'
    ,'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/74.0.3729.169 Safari/537.36'
    ,'X-Request-Type': 'ajax'
    ,'X-Requested-With': 'XMLHttpRequest'}

    pagenum = int(pageNum)
    payload = {"seDate":["2020-01-01","2020-12-31"],"channelCode":["fixed_disc"],"bigCategoryId":["010303"],"pageSize":30,"pageNum":pagenum}
    response = requests.post(url,headers =headers,data = json.dumps(payload)) #使用json格式
    result = response.json()
    return result
 
#創建一個DataFrame儲存爬取信息
pdf_infor= pd.DataFrame(columns =['secCode','secName','url','title','publishTime'])

# 下載域名的前綴
count = 0
url_head ='http://disc.static.szse.cn/download/'

#起始頁數時page_a,我一次只爬了10頁,所以截至頁數是page_b = page_a + 10
page_a = 150
page_b = page_a + 10

path_xlsx = 'download_url_' + str(page_a) + '_' + str(page_b-1) + '.xlsx' #保存為excel的文件名

for i in range(page_a,page_b):

    print("爬取深交所年報下載地址第{}頁".format(i))
    result = get_pdf_address(i)
    num = len(result['data'])
    for each in range(num):
        #each = 1
        pdf_infor.at[count,'secCode'] = result['data'][each]['secCode'][0]
        pdf_infor.at[count,'secName'] = result['data'][each]['secName'][0]
        pdf_infor.at[count,'url'] = url_head + result['data'][each]['attachPath']
        pdf_infor.at[count,'title'] = result['data'][each]['title']
        pdf_infor.at[count,'publishTime'] = result['data'][each]['publishTime']
        count += 1
    print('獲取完成')
    time.sleep(random.uniform(2,3)) #控制訪問速度
 
# 提取title中字符串獲取年份
pdf_infor['Year'] = pdf_infor['title'].str.extract('([0-9]{4})')
pdf_infor.to_excel(path_xlsx) #保存為excel

　　手動改一改page_a的值就可以得到所有的urli鏈接了，當然也可以寫一個循環自動去十頁十頁的爬。我分開處理是為了防止出現問題，方便調試。在爬取url鏈接的時候還是很順利的，服務器一次也沒有拒絕我的請求。

隨便打開一個excel，可以看到密密麻麻的全是url鏈接。但是仔細看就會發現里面不止是財報，還有年度報告的摘要，所以我們小小的處理一下。

# pandsa刪除摘要
#保存的時候會默認把股票代碼開頭的0去掉
import pandas as pd

file_list = os.listdir()
path_xlsx = '1_delete.xlsx'
pdf_infor = pd.read_excel('1.xlsx')

#print(type(pdf_infor.at[2,'title']))


for i in range(pdf_infor.shape[0]):
    zhaiyao = pdf_infor.at[i,'title']
    if zhaiyao.find('摘要') != -1:
        pdf_infor =  pdf_infor.drop(i)

pdf_infor.to_excel(path_xlsx)

　　處理完成后會發現所有的股票代碼前面的0都不見了，我是在所有財報下載完成后統一修改文件名的。處理方法有很多種，這也不是什么大問題。

　　最后就是下載了，下載的時候會偶爾遇到服務器拒絕我的請求，所以只能用try方法，被拒絕的時候把沒有成功下載的文件序號記下來，等全部跑完了再單獨處理就好了。話不多說，上代碼。

'''
pandas 讀取數據批量下載
無法下載的保存在txt中,晚點單獨下載
'''
import requests
import time
import pandas as pd
import random
import os
import json

file_path= "pdf_1/"

pdf_infor = pd.read_excel('1.xlsx')

headers ={'Upgrade-Insecure-Requests':'1',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}

#下載列表里第i個url
def download(i,f_url):
    f_url = pdf_infor.at[i,'url']
    return requests.get(f_url,headers = headers)

for each in range(pdf_infor.shape[0]):
    Stkcd = pdf_infor.at[each,'secCode']
    firm_name = pdf_infor.at[each,'secName'].replace("*","")
    Year = pdf_infor.at[each,'Year']
    pdf_url = pdf_infor.at[each,'url']

    file_name = "{}{}{}.pdf".format(Stkcd,firm_name,Year)
    file_full_name = os.path.join(file_path,file_name)
    print("開始下載{},股票代碼{}的{}年報".format(firm_name,Stkcd,Year))

    try:
        time.sleep(random.uniform(4,6)) #控制訪問速度
        rs = download(each,pdf_url)      
        with open(file_full_name, "wb") as fp:
            for chunk in rs.iter_content(chunk_size=10240):
                if chunk:
                    fp.write(chunk)
    except Exception as e:
        with open('log.txt', "a") as f:
            f.write(str(each)+'\n') #當然也可以直接把url給保存到錯誤日志里
        print(e)   

    print("===================下載完成==========================")

　　最后看一下爬取的效果吧，這里就只剩下股票代碼有點問題了。我干脆就全部給加上吧。代碼如下

#因為之前去掉excel里摘要url的時候把所有股票代碼開頭的0都去掉了,現在補上
import os
import re

i = 'test'#保存pdf的路徑
#補全開頭的0
def fix_name(name):
	for i in range(6-len(name)):
		name = '0' + name
	return name

file_path = os.listdir(i)
for file in file_path:
	#print(file)
	stock_num_wrong = re.findall(r'\d+',file)#去掉年份提出數字

	rest_file = file[len(stock_num_wrong[0]):]#去掉提取出的數字

	stock_num_new = fix_name(str(stock_num_wrong[0]))#第一個是股票代碼第二個是年份

	file_name_new = stock_num_new + rest_file
	os.rename(os.path.join(i,file),os.path.join(i,file_name_new))
	print(rest_file)

　　補全以后就大功告成了。當然這個方法有挺多地方可以整合改進的，但是我太懶了，反正數據也有了，就這樣吧！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 深圳證券交易所網絡投票業務指南深圳證券交易所第五代交易系統上海證券交易所日數據概況爬蟲上海證券交易所-債券品種介紹證券交易所--本方最優&對手方最優的區別證券交易涉及交易所數據的幾種類型證券交易規則淺談證券交易中的算法交易 Python爬蟲爬取上海黃金交易所歷史交易數據爬取上證交易所的每周股票交易概況