Python-天天基金網爬蟲分析


一、選題背景

為什么要選擇此選題?要達到的數據分析的預期目標是什么?

隨着互聯網進入大數據時代,人們獲取咨詢的方法越來越多,而財經信息又與人們的生活息息相關,所以關於財經的信息就有為重要,為了能更快更好的了解市場基金的走向,我選擇了這個課題,主要為了更方便了解有關基金的動態。

二、主題式網絡爬蟲設計方案

1.主題式網絡爬蟲名稱:天天基金網爬蟲分析

2.主題式網絡爬蟲爬取的內容與數據特征分析:通過訪問天天基金的網站,爬取相對應的信息,最后保存下來做可視化分析。

3.主題式網絡爬蟲設計方案概述(包括實現思路與技術難點):

首先,用request進行訪問頁面。

其次,用xtree來獲取頁面內容,用etree.xpath進行數據篩選。

最后,文件操作進行數據的保存。

難點:網站的爬取與數據篩選。

技術難點:

三、主題頁面的結構特征分析

1.主題頁面的結構與特征分析

數據來源:http://fund.eastmoney.com/fund.html

 

 

 2.Htmls 頁面解析

四、網絡爬蟲程序設計

爬蟲程序主體要包括以下各部分,要附源代碼及較詳細注釋,並在每部分程序后面提供輸出結果的截圖。

1.數據爬取與采集

"""ua大列表"""
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]

 

 

 2.對數據進行清洗和處理

def __init__(self):
        # 起始的請求地址----初始化
        self.start_url = 'http://fund.eastmoney.com/fund.html'
        # 第二份數據地址
        self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html'

    def parse_start_url(self):
        """
        發送請求,獲取響應
        :return:
        """
        # 請求頭
        headers = {
            # 通過隨機模塊提供的隨機拿取數據方法
            'User-Agent': random.choice(USER_AGENT_LIST)
        }
        # 發送請求,獲取響應字節數據
        response = session.get(self.start_url, headers=headers).content
        """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象"""
        response = etree.HTML(response)
        """調用提取第二份響應數據"""
        self.parse_next_url_response(response)

    def parse_next_url_response(self, response_1):
        """
        解析第二個數據頁地址
        :return:
        """
        # 請求頭
        headers = {
            # 通過隨機模塊提供的隨機拿取數據方法
            'User-Agent': random.choice(USER_AGENT_LIST)
        }
        # 發送請求,獲取響應字節數據
        response = session.get(self.start_url, headers=headers).content
        """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象"""
        response = etree.HTML(response)
        """調用解析response響應數據方法"""
        self.parse_response_data(response, response_1)

    def parse_response_data(self, response_1, response):
        """
        解析response響應數據,提取
        :return:
        """
        # 股票名稱
        name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')
        name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')
        # 合並
        name_list = name_list_1 + name_list_2
        # 昨日單位凈值
        num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()')
        num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()')
        # 合並
        num_1_list = num_1_list_data_1 + num_1_list_data_2
        # 昨日累計凈值
        num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()')
        num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()')
        # 合並
        num_2_list = num_2_list_data_1 + num_2_list_data_2
        """調用解析三個列表的方法"""
        self.for_parse_three_list(name_list, num_1_list, num_2_list)

    def for_parse_three_list(self, name_list, num_1_list, num_2_list):
        """
        解析循環,
        :param name_list: 股票名稱
        :param num_1_list: 昨日單位凈值
        :param num_2_list: 昨日累計凈值
        :return:
        """
        # 遍歷解析3個列表數據
        for a, b, c in zip(name_list, num_1_list, num_2_list):
            # 構造保存的excel字典數據
            dict_data = {
                # 會根據該字典的key值創建工作簿的sheet名
                '股票數據': [a, b, c]
            }
            """調用解析保存excel表格方法"""
            self.parse_save_excel(dict_data)
            print(f'企業:{a}----采集完成!')
        """數據采集完成,調用分析生成圖像方法"""
        self.parse_random_data(name_list, num_1_list, num_2_list)

    def parse_random_data(self, name_list, num_1_list, num_2_list):
        """
        隨機抽取15條數據,進行分析
        :return:
        """
        # 存放隨機號碼的列表
        index_list = []
        for i in range(15):
            # 隨機抽取15個數據進行分析
            random_num = random.randint(0, 200)
            # 將隨機抽取的號碼添加進入准備的列表中
            index_list.append(random_num)
        """隨機號碼生成以后,調用解析生成四張分析圖的方法"""
        self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list)

4.數據分析與可視化(例如:數據柱形圖、直方圖、散點圖、盒圖、分布圖)

def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list):
        """
        解析生成四張分析圖
        :param index_list: 隨機數據的下標
        :param name_list: 股票名稱列表
        :param num_1_list: 昨日單位凈值列表
        :param num_2_list: 昨日累計凈值列表
        :return:
        """
        title_list = []  # 名稱
        qy_num_1 = []    # 單位凈值
        qy_num_2 = []    # 累計凈值
        for index_num in index_list:
            # 企業名稱列表
            title_list.append(name_list[index_num])
            # 昨日單位凈值列表
            qy_num_1.append(num_1_list[index_num])
            # 昨日累計凈值列表
            qy_num_2.append(num_2_list[index_num])
        # 第一張圖:根據凈值生成折線圖
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        # plot中參數的含義分別是橫軸值,縱軸值,線的形狀,顏色,透明度,線的寬度和標簽
        plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累計凈值')
        plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='單位凈值')
        # 顯示標簽,如果不加這句,即使在plot中加了label='一些數字'的參數,最終還是不會顯示標簽
        plt.legend(loc="upper right")
        plt.xticks(rotation=270)
        plt.xlabel('地點數量')
        plt.ylabel('工作屬性數量')
        plt.savefig('根據凈值生成折線圖.png')
        plt.show()

        # 第二張圖:根據單位凈值生成餅圖
        addr_dict_key = title_list
        addr_dict_value = qy_num_1
        plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
        plt.rcParams['axes.unicode_minus'] = False
        plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%')
        plt.title(f'單位凈值對比')
        plt.savefig(f'單位凈值對比-餅圖')
        plt.show()

        # 第三張圖:根據累計凈值生成散點圖
        # 這兩行代碼解決 plt 中文顯示的問題
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        # 輸入崗位地址和崗位屬性數據
        production = title_list
        tem = qy_num_2
        colors = np.random.rand(len(tem))  # 顏色數組
        plt.scatter(tem, production, s=200, c=colors)  # 畫散點圖,大小為 200
        plt.xlabel('數量')  # 橫坐標軸標題
        plt.xticks(rotation=270)
        plt.ylabel('名稱')  # 縱坐標軸標題
        plt.savefig(f'凈值散點圖.png')
        plt.show()

        # 第四張圖:根據凈值生成柱狀圖
        import matplotlib;matplotlib.use('TkAgg')
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc')
        name_list = title_list
        num_list = [float(i) for i in qy_num_1]  # 單位凈值
        width = 0.5  # 柱子的寬度
        index = np.arange(len(name_list))
        plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='單位凈值')
        plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累計凈值')
        plt.legend(['單位凈值', '累計凈值'], prop=zhfont1, labelspacing=1)
        for a, b in zip(index, num_list):  # 柱子上的數字顯示
            plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7)
        plt.xticks(rotation=270)
        plt.title('凈值柱狀圖')
        plt.ylabel('')
        plt.legend()
        plt.savefig(f'凈值-柱狀圖', bbox_inches='tight')
        plt.show()

 

 

 

 

 

 

 

 

 

 

 

 5.將以上各部分的代碼匯總,附上完整程序代碼

"""ua大列表"""
USER_AGENT_LIST = [
                  'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
                  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
                  'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
                  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2',
                  'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
                  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)',

                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
                  'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)',
                  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
                  'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84',
                  'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0',
                  'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
                  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
                  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
                  ]

from requests_html import HTMLSession
import os, xlwt, xlrd, random
from xlutils.copy import copy
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.font_manager import FontProperties  # 字體庫
from lxml import etree
session = HTMLSession()


class DFSpider(object):

    def __init__(self):
        # 起始的請求地址----初始化
        self.start_url = 'http://fund.eastmoney.com/fund.html'
        # 第二份數據地址
        self.next_url = 'http://fund.eastmoney.com/HBJJ_pjsyl.html'

    def parse_start_url(self):
        """
        發送請求,獲取響應
        :return:
        """
        # 請求頭
        headers = {
            # 通過隨機模塊提供的隨機拿取數據方法
            'User-Agent': random.choice(USER_AGENT_LIST)
        }
        # 發送請求,獲取響應字節數據
        response = session.get(self.start_url, headers=headers).content
        """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象"""
        response = etree.HTML(response)
        """調用提取第二份響應數據"""
        self.parse_next_url_response(response)

    def parse_next_url_response(self, response_1):
        """
        解析第二個數據頁地址
        :return:
        """
        # 請求頭
        headers = {
            # 通過隨機模塊提供的隨機拿取數據方法
            'User-Agent': random.choice(USER_AGENT_LIST)
        }
        # 發送請求,獲取響應字節數據
        response = session.get(self.start_url, headers=headers).content
        """序列化對象,將字節內容數據,經過轉換,變成可進行xpath操作的對象"""
        response = etree.HTML(response)
        """調用解析response響應數據方法"""
        self.parse_response_data(response, response_1)

    def parse_response_data(self, response_1, response):
        """
        解析response響應數據,提取
        :return:
        """
        # 股票名稱
        name_list_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')
        name_list_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')
        # 合並
        name_list = name_list_1 + name_list_2
        # 昨日單位凈值
        num_1_list_data_1 = response.xpath('//tbody/tr/td[6]/text()')
        num_1_list_data_2 = response_1.xpath('//tr/td[6]/span/text()')
        # 合並
        num_1_list = num_1_list_data_1 + num_1_list_data_2
        # 昨日累計凈值
        num_2_list_data_1 = response.xpath('//tbody/tr/td[7]/text()')
        num_2_list_data_2 = response_1.xpath('//tr/td[7]/text()')
        # 合並
        num_2_list = num_2_list_data_1 + num_2_list_data_2
        """調用解析三個列表的方法"""
        self.for_parse_three_list(name_list, num_1_list, num_2_list)

    def for_parse_three_list(self, name_list, num_1_list, num_2_list):
        """
        解析循環,
        :param name_list: 股票名稱
        :param num_1_list: 昨日單位凈值
        :param num_2_list: 昨日累計凈值
        :return:
        """
        # 遍歷解析3個列表數據
        for a, b, c in zip(name_list, num_1_list, num_2_list):
            # 構造保存的excel字典數據
            dict_data = {
                # 會根據該字典的key值創建工作簿的sheet名
                '股票數據': [a, b, c]
            }
            """調用解析保存excel表格方法"""
            self.parse_save_excel(dict_data)
            print(f'企業:{a}----采集完成!')
        """數據采集完成,調用分析生成圖像方法"""
        self.parse_random_data(name_list, num_1_list, num_2_list)

    def parse_random_data(self, name_list, num_1_list, num_2_list):
        """
        隨機抽取15條數據,進行分析
        :return:
        """
        # 存放隨機號碼的列表
        index_list = []
        for i in range(15):
            # 隨機抽取15個數據進行分析
            random_num = random.randint(0, 200)
            # 將隨機抽取的號碼添加進入准備的列表中
            index_list.append(random_num)
        """隨機號碼生成以后,調用解析生成四張分析圖的方法"""
        self.parse_img_four_func(index_list, name_list, num_1_list, num_2_list)

    def parse_img_four_func(self, index_list, name_list, num_1_list, num_2_list):
        """
        解析生成四張分析圖
        :param index_list: 隨機數據的下標
        :param name_list: 股票名稱列表
        :param num_1_list: 昨日單位凈值列表
        :param num_2_list: 昨日累計凈值列表
        :return:
        """
        title_list = []  # 名稱
        qy_num_1 = []    # 單位凈值
        qy_num_2 = []    # 累計凈值
        for index_num in index_list:
            # 企業名稱列表
            title_list.append(name_list[index_num])
            # 昨日單位凈值列表
            qy_num_1.append(num_1_list[index_num])
            # 昨日累計凈值列表
            qy_num_2.append(num_2_list[index_num])
        # 第一張圖:根據凈值生成折線圖
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        # plot中參數的含義分別是橫軸值,縱軸值,線的形狀,顏色,透明度,線的寬度和標簽
        plt.plot(title_list, qy_num_2, 'ro-', color='#4169E1', alpha=0.8, linewidth=1, label='累計凈值')
        plt.plot(title_list, qy_num_1, 'ro-', color='#69e141', alpha=0.8, linewidth=1, label='單位凈值')
        # 顯示標簽,如果不加這句,即使在plot中加了label='一些數字'的參數,最終還是不會顯示標簽
        plt.legend(loc="upper right")
        plt.xticks(rotation=270)
        plt.xlabel('地點數量')
        plt.ylabel('工作屬性數量')
        plt.savefig('根據凈值生成折線圖.png')
        plt.show()

        # 第二張圖:根據單位凈值生成餅圖
        addr_dict_key = title_list
        addr_dict_value = qy_num_1
        plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
        plt.rcParams['axes.unicode_minus'] = False
        plt.pie(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%')
        plt.title(f'單位凈值對比')
        plt.savefig(f'單位凈值對比-餅圖')
        plt.show()

        # 第三張圖:根據累計凈值生成散點圖
        # 這兩行代碼解決 plt 中文顯示的問題
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        # 輸入崗位地址和崗位屬性數據
        production = title_list
        tem = qy_num_2
        colors = np.random.rand(len(tem))  # 顏色數組
        plt.scatter(tem, production, s=200, c=colors)  # 畫散點圖,大小為 200
        plt.xlabel('數量')  # 橫坐標軸標題
        plt.xticks(rotation=270)
        plt.ylabel('名稱')  # 縱坐標軸標題
        plt.savefig(f'凈值散點圖.png')
        plt.show()

        # 第四張圖:根據凈值生成柱狀圖
        import matplotlib;matplotlib.use('TkAgg')
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        zhfont1 = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\simsun.ttc')
        name_list = title_list
        num_list = [float(i) for i in qy_num_1]  # 單位凈值
        width = 0.5  # 柱子的寬度
        index = np.arange(len(name_list))
        plt.bar(index, num_list, width, color='steelblue', tick_label=name_list, label='單位凈值')
        plt.bar(index + width, qy_num_2, width, color='red', hatch='\\', label='累計凈值')
        plt.legend(['單位凈值', '累計凈值'], prop=zhfont1, labelspacing=1)
        for a, b in zip(index, num_list):  # 柱子上的數字顯示
            plt.text(a, b, '%.2f' % b, ha='center', va='bottom', fontsize=7)
        plt.xticks(rotation=270)
        plt.title('凈值柱狀圖')
        plt.ylabel('')
        plt.legend()
        plt.savefig(f'凈值-柱狀圖', bbox_inches='tight')
        plt.show()

    def parse_save_excel(self, data_dict):
        """
        保存數據
        :return:
        """
        # 判斷保存數據的文件夾是否存在,不存在,就創建
        os_path_1 = os.getcwd() + '/數據/'
        if not os.path.exists(os_path_1):
            os.mkdir(os_path_1)
        os_path = os_path_1 + '股票數據.xls'
        if not os.path.exists(os_path):
            # 創建新的workbook(其實就是創建新的excel)
            workbook = xlwt.Workbook(encoding='utf-8')
            # 創建新的sheet表
            worksheet1 = workbook.add_sheet("股票數據", cell_overwrite_ok=True)
            excel_data_1 = ('股票名稱', '昨日單位凈值', '昨日累計凈值')
            for i in range(0, len(excel_data_1)):
                worksheet1.col(i).width = 2560 * 3
                #               行,列,  內容,            樣式
                worksheet1.write(0, i, excel_data_1[i])
            workbook.save(os_path)
        # 判斷工作表是否存在
        if os.path.exists(os_path):
            # 打開工作薄
            workbook = xlrd.open_workbook(os_path)
            # 獲取工作薄中所有表的個數
            sheets = workbook.sheet_names()
            for i in range(len(sheets)):
                for name in data_dict.keys():
                    worksheet = workbook.sheet_by_name(sheets[i])
                    # 獲取工作薄中所有表中的表名與數據名對比
                    if worksheet.name == name:
                        # 獲取表中已存在的行數
                        rows_old = worksheet.nrows
                        # 將xlrd對象拷貝轉化為xlwt對象
                        new_workbook = copy(workbook)
                        # 獲取轉化后的工作薄中的第i張表
                        new_worksheet = new_workbook.get_sheet(i)
                        for num in range(0, len(data_dict[name])):
                            new_worksheet.write(rows_old, num, data_dict[name][num])
                        new_workbook.save(os_path)

    def run(self):
        """
        啟動方法
        :return:
        """
        self.parse_start_url()


if __name__ == '__main__':
    d = DFSpider()
    d.run()

五、總結

通過這次的課程設計實驗,我對Python又有了進一步的了解,也對Python的爬蟲技術有了更熟練的操作,在實驗制作過程中也遇到了很多問題,但都通過同學、老師的幫助以及自己上網搜集到的資料從而能夠完成此次的實驗。

在此次實驗中,我發現自己還是有很多的不足,以及對Python學習存在許多盲區,從而讓我對Python的學習預發重視。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM