一、選題背景
為什么要選擇此選題?要達到的數據分析的預期目標是什么?
隨着互聯網進入大數據時代,人們獲取咨詢的方法越來越多,而財經信息又與人們的生活息息相關,所以關於財經的信息就有為重要,為了能更快更好的了解市場基金的走向,我選擇了這個課題,主要為了更方便了解有關基金的動態。
二、主題式網絡爬蟲設計方案
1.主題式網絡爬蟲名稱:天天基金網爬蟲分析
2.主題式網絡爬蟲爬取的內容與數據特征分析:通過訪問天天基金的網站,爬取相對應的信息,最后保存下來做可視化分析。
3.主題式網絡爬蟲設計方案概述(包括實現思路與技術難點):
首先,用request進行訪問頁面。
其次,用xtree來獲取頁面內容,用etree.xpath進行數據篩選。
最后,文件操作進行數據的保存。
難點:網站的爬取與數據篩選。
技術難點:
三、主題頁面的結構特征分析
1.主題頁面的結構與特征分析
數據來源:http://fund.eastmoney.com/fund.html
2.Htmls 頁面解析
四、網絡爬蟲程序設計
爬蟲程序主體要包括以下各部分,要附源代碼及較詳細注釋,並在每部分程序后面提供輸出結果的截圖。
1.數據爬取與采集
"""ua大列表"""
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]
2.對數據進行清洗和處理
def __init__(self): # 起始的請求地址----初始化 self.start_url = 'http://fund.eastmoney.com/fund.html' # 第二份數據地址 self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html' def parse_start_url(self): """ 發送請求,獲取響應 :return: """ # 請求頭 headers = { # 通過隨機模塊提供的隨機拿取數據方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 發送請求,獲取響應字節數據 response = session.get(self.start_url, headers=headers).content """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象""" response = etree.HTML(response) """調用提取第二份響應數據""" self.parse_next_url_response(response) def parse_next_url_response(self, response_1): """ 解析第二個數據頁地址 :return: """ # 請求頭 headers = { # 通過隨機模塊提供的隨機拿取數據方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 發送請求,獲取響應字節數據 response = session.get(self.start_url, headers=headers).content """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象""" response = etree.HTML(response) """調用解析response響應數據方法""" self.parse_response_data(response, response_1) def parse_response_data(self, response_1, response): """ 解析response響應數據,提取 :return: """ # 股票名稱 name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') # 合並 name_list = name_list_1 + name_list_2 # 昨日單位凈值 num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()') num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()') # 合並 num_1_list = num_1_list_data_1 + num_1_list_data_2 # 昨日累計凈值 num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()') num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()') # 合並 num_2_list = num_2_list_data_1 + num_2_list_data_2 """調用解析三個列表的方法""" self.for_parse_three_list(name_list, num_1_list, num_2_list) def for_parse_three_list(self, name_list, num_1_list, num_2_list): """ 解析循環, :param name_list: 股票名稱 :param num_1_list: 昨日單位凈值 :param num_2_list: 昨日累計凈值 :return: """ # 遍歷解析3個列表數據 for a, b, c in zip(name_list, num_1_list, num_2_list): # 構造保存的excel字典數據 dict_data = { # 會根據該字典的key值創建工作簿的sheet名 '股票數據': [a, b, c] } """調用解析保存excel表格方法""" self.parse_save_excel(dict_data) print(f'企業:{a}----采集完成!') """數據采集完成,調用分析生成圖像方法""" self.parse_random_data(name_list, num_1_list, num_2_list) def parse_random_data(self, name_list, num_1_list, num_2_list): """ 隨機抽取15條數據,進行分析 :return: """ # 存放隨機號碼的列表 index_list = [] for i in range(15): # 隨機抽取15個數據進行分析 random_num = random.randint(0, 200) # 將隨機抽取的號碼添加進入准備的列表中 index_list.append(random_num) """隨機號碼生成以后,調用解析生成四張分析圖的方法""" self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list)
4.數據分析與可視化(例如:數據柱形圖、直方圖、散點圖、盒圖、分布圖)
def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list): """ 解析生成四張分析圖 :param index_list: 隨機數據的下標 :param name_list: 股票名稱列表 :param num_1_list: 昨日單位凈值列表 :param num_2_list: 昨日累計凈值列表 :return: """ title_list = [] # 名稱 qy_num_1 = [] # 單位凈值 qy_num_2 = [] # 累計凈值 for index_num in index_list: # 企業名稱列表 title_list.append(name_list[index_num]) # 昨日單位凈值列表 qy_num_1.append(num_1_list[index_num]) # 昨日累計凈值列表 qy_num_2.append(num_2_list[index_num]) # 第一張圖:根據凈值生成折線圖 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # plot中參數的含義分別是橫軸值,縱軸值,線的形狀,顏色,透明度,線的寬度和標簽 plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累計凈值') plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='單位凈值') # 顯示標簽,如果不加這句,即使在plot中加了label='一些數字'的參數,最終還是不會顯示標簽 plt.legend(loc="upper right") plt.xticks(rotation=270) plt.xlabel('地點數量') plt.ylabel('工作屬性數量') plt.savefig('根據凈值生成折線圖.png') plt.show() # 第二張圖:根據單位凈值生成餅圖 addr_dict_key = title_list addr_dict_value = qy_num_1 plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] plt.rcParams['axes.unicode_minus'] = False plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%') plt.title(f'單位凈值對比') plt.savefig(f'單位凈值對比-餅圖') plt.show() # 第三張圖:根據累計凈值生成散點圖 # 這兩行代碼解決 plt 中文顯示的問題 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 輸入崗位地址和崗位屬性數據 production = title_list tem = qy_num_2 colors = np.random.rand(len(tem)) # 顏色數組 plt.scatter(tem, production, s=200, c=colors) # 畫散點圖,大小為 200 plt.xlabel('數量') # 橫坐標軸標題 plt.xticks(rotation=270) plt.ylabel('名稱') # 縱坐標軸標題 plt.savefig(f'凈值散點圖.png') plt.show() # 第四張圖:根據凈值生成柱狀圖 import matplotlib;matplotlib.use('TkAgg') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc') name_list = title_list num_list = [float(i) for i in qy_num_1] # 單位凈值 width = 0.5 # 柱子的寬度 index = np.arange(len(name_list)) plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='單位凈值') plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累計凈值') plt.legend(['單位凈值', '累計凈值'], prop=zhfont1, labelspacing=1) for a, b in zip(index, num_list): # 柱子上的數字顯示 plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7) plt.xticks(rotation=270) plt.title('凈值柱狀圖') plt.ylabel('率') plt.legend() plt.savefig(f'凈值-柱狀圖', bbox_inches='tight') plt.show()
5.將以上各部分的代碼匯總,附上完整程序代碼
"""ua大列表""" USER_AGENT_LIST = [ 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36', 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0', 'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', ] from requests_html import HTMLSession import os, xlwt, xlrd, random from xlutils.copy import copy import numpy as np from matplotlib import pyplot as plt from matplotlib.font_manager import FontProperties # 字體庫 from lxml import etree session = HTMLSession() class DFSpider(object): def __init__(self): # 起始的請求地址----初始化 self.start_url = 'http://fund.eastmoney.com/fund.html' # 第二份數據地址 self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html' def parse_start_url(self): """ 發送請求,獲取響應 :return: """ # 請求頭 headers = { # 通過隨機模塊提供的隨機拿取數據方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 發送請求,獲取響應字節數據 response = session.get(self.start_url, headers=headers).content """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象""" response = etree.HTML(response) """調用提取第二份響應數據""" self.parse_next_url_response(response) def parse_next_url_response(self, response_1): """ 解析第二個數據頁地址 :return: """ # 請求頭 headers = { # 通過隨機模塊提供的隨機拿取數據方法 'User-Agent': random.choice(USER_AGENT_LIST) } # 發送請求,獲取響應字節數據 response = session.get(self.start_url, headers=headers).content """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象""" response = etree.HTML(response) """調用解析response響應數據方法""" self.parse_response_data(response, response_1) def parse_response_data(self, response_1, response): """ 解析response響應數據,提取 :return: """ # 股票名稱 name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()') # 合並 name_list = name_list_1 + name_list_2 # 昨日單位凈值 num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()') num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()') # 合並 num_1_list = num_1_list_data_1 + num_1_list_data_2 # 昨日累計凈值 num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()') num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()') # 合並 num_2_list = num_2_list_data_1 + num_2_list_data_2 """調用解析三個列表的方法""" self.for_parse_three_list(name_list, num_1_list, num_2_list) def for_parse_three_list(self, name_list, num_1_list, num_2_list): """ 解析循環, :param name_list: 股票名稱 :param num_1_list: 昨日單位凈值 :param num_2_list: 昨日累計凈值 :return: """ # 遍歷解析3個列表數據 for a, b, c in zip(name_list, num_1_list, num_2_list): # 構造保存的excel字典數據 dict_data = { # 會根據該字典的key值創建工作簿的sheet名 '股票數據': [a, b, c] } """調用解析保存excel表格方法""" self.parse_save_excel(dict_data) print(f'企業:{a}----采集完成!') """數據采集完成,調用分析生成圖像方法""" self.parse_random_data(name_list, num_1_list, num_2_list) def parse_random_data(self, name_list, num_1_list, num_2_list): """ 隨機抽取15條數據,進行分析 :return: """ # 存放隨機號碼的列表 index_list = [] for i in range(15): # 隨機抽取15個數據進行分析 random_num = random.randint(0, 200) # 將隨機抽取的號碼添加進入准備的列表中 index_list.append(random_num) """隨機號碼生成以后,調用解析生成四張分析圖的方法""" self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list) def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list): """ 解析生成四張分析圖 :param index_list: 隨機數據的下標 :param name_list: 股票名稱列表 :param num_1_list: 昨日單位凈值列表 :param num_2_list: 昨日累計凈值列表 :return: """ title_list = [] # 名稱 qy_num_1 = [] # 單位凈值 qy_num_2 = [] # 累計凈值 for index_num in index_list: # 企業名稱列表 title_list.append(name_list[index_num]) # 昨日單位凈值列表 qy_num_1.append(num_1_list[index_num]) # 昨日累計凈值列表 qy_num_2.append(num_2_list[index_num]) # 第一張圖:根據凈值生成折線圖 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # plot中參數的含義分別是橫軸值,縱軸值,線的形狀,顏色,透明度,線的寬度和標簽 plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累計凈值') plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='單位凈值') # 顯示標簽,如果不加這句,即使在plot中加了label='一些數字'的參數,最終還是不會顯示標簽 plt.legend(loc="upper right") plt.xticks(rotation=270) plt.xlabel('地點數量') plt.ylabel('工作屬性數量') plt.savefig('根據凈值生成折線圖.png') plt.show() # 第二張圖:根據單位凈值生成餅圖 addr_dict_key = title_list addr_dict_value = qy_num_1 plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] plt.rcParams['axes.unicode_minus'] = False plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%') plt.title(f'單位凈值對比') plt.savefig(f'單位凈值對比-餅圖') plt.show() # 第三張圖:根據累計凈值生成散點圖 # 這兩行代碼解決 plt 中文顯示的問題 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 輸入崗位地址和崗位屬性數據 production = title_list tem = qy_num_2 colors = np.random.rand(len(tem)) # 顏色數組 plt.scatter(tem, production, s=200, c=colors) # 畫散點圖,大小為 200 plt.xlabel('數量') # 橫坐標軸標題 plt.xticks(rotation=270) plt.ylabel('名稱') # 縱坐標軸標題 plt.savefig(f'凈值散點圖.png') plt.show() # 第四張圖:根據凈值生成柱狀圖 import matplotlib;matplotlib.use('TkAgg') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc') name_list = title_list num_list = [float(i) for i in qy_num_1] # 單位凈值 width = 0.5 # 柱子的寬度 index = np.arange(len(name_list)) plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='單位凈值') plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累計凈值') plt.legend(['單位凈值', '累計凈值'], prop=zhfont1, labelspacing=1) for a, b in zip(index, num_list): # 柱子上的數字顯示 plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7) plt.xticks(rotation=270) plt.title('凈值柱狀圖') plt.ylabel('率') plt.legend() plt.savefig(f'凈值-柱狀圖', bbox_inches='tight') plt.show() def parse_save_excel(self, data_dict): """ 保存數據 :return: """ # 判斷保存數據的文件夾是否存在,不存在,就創建 os_path_1 = os.getcwd() + '/數據/' if not os.path.exists(os_path_1): os.mkdir(os_path_1) os_path = os_path_1 + '股票數據.xls' if not os.path.exists(os_path): # 創建新的workbook(其實就是創建新的excel) workbook = xlwt.Workbook(encoding='utf-8') # 創建新的sheet表 worksheet1 = workbook.add_sheet("股票數據", cell_overwrite_ok=True) excel_data_1 = ('股票名稱', '昨日單位凈值', '昨日累計凈值') for i in range(0, len(excel_data_1)): worksheet1.col(i).width = 2560 * 3 # 行,列, 內容, 樣式 worksheet1.write(0, i, excel_data_1[i]) workbook.save(os_path) # 判斷工作表是否存在 if os.path.exists(os_path): # 打開工作薄 workbook = xlrd.open_workbook(os_path) # 獲取工作薄中所有表的個數 sheets = workbook.sheet_names() for i in range(len(sheets)): for name in data_dict.keys(): worksheet = workbook.sheet_by_name(sheets[i]) # 獲取工作薄中所有表中的表名與數據名對比 if worksheet.name == name: # 獲取表中已存在的行數 rows_old = worksheet.nrows # 將xlrd對象拷貝轉化為xlwt對象 new_workbook = copy(workbook) # 獲取轉化后的工作薄中的第i張表 new_worksheet = new_workbook.get_sheet(i) for num in range(0, len(data_dict[name])): new_worksheet.write(rows_old, num, data_dict[name][num]) new_workbook.save(os_path) def run(self): """ 啟動方法 :return: """ self.parse_start_url() if __name__ == '__main__': d = DFSpider() d.run()
五、總結
通過這次的課程設計實驗,我對Python又有了進一步的了解,也對Python的爬蟲技術有了更熟練的操作,在實驗制作過程中也遇到了很多問題,但都通過同學、老師的幫助以及自己上網搜集到的資料從而能夠完成此次的實驗。
在此次實驗中,我發現自己還是有很多的不足,以及對Python學習存在許多盲區,從而讓我對Python的學習預發重視。