python爬蟲—孔夫子舊書網數據可視化分析


一、選題背景

現如今,購買書的渠道有很多,京東、淘寶、天貓、當當網、咸魚……我此次選題是舊二手書期刊類數據可視化分析。

二、網絡爬蟲設計方案

爬蟲名稱:孔夫子舊書網期刊數據爬取

內容:通過爬蟲程序爬取期刊舊書價格,然后進行數據可視化分析。

方案描述:

1、request請求訪問

2、解析網頁,爬取數據。這里采用xtree.xpath

3、數據保存,使用sys

三、結構特征分析

結構特征:內容導航型

 

 

 結構分析:

及查找方法

#書名bookname、出版社publishing_house、發貨率delivery、價格price、上架時間bookTime_on_shelf、書店bookShop
bookname = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count))
publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count))
delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count))
price = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count))
bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count))
bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count))

遍歷:

            for i in range(50):
                bookname = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count))
                for i in bookname:
                    bookname = i
                publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count))
                for i in publishing_house:
                    publishing_house = i
                delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count))
                for i in delivery:
                    delivery = i.strip("%")
                price = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count))
                for i in price:
                    price = i
                bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count))
                for i in bookTime_on_shelf:
                    bookTime_on_shelf = i
                bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count))
                for i in bookShop:
                    bookShop = i
                count += 1

四、網絡爬蟲設計

數據爬取與采集

代碼分析:

 1 import time
 2 import random
 3 import requests
 4 from lxml import etree
 5 import sys
 6 import re
 7 
 8 
 9 USER_AGENTS = [
10                 'Mozilla/5.0 (Windows NT 6.2; rv:22.0) Gecko/20130405 Firefox/22.0',
11                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:22.0) Gecko/20130328 Firefox/22.0',
12                 'Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20130405 Firefox/22.0',
13                 'Mozilla/5.0 (Microsoft Windows NT 6.2.9200.0); rv:22.0) Gecko/20130405 Firefox/22.0',
14                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1',
15                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1',
16                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:21.0.0) Gecko/20121011 Firefox/21.0.0',
17                 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20130514 Firefox/21.0',
18                 'Mozilla/5.0 (Windows NT 6.2; rv:21.0) Gecko/20130326 Firefox/21.0',
19                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130401 Firefox/21.0',
20                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130331 Firefox/21.0',
21                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130330 Firefox/21.0',
22                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0',
23                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20130401 Firefox/21.0',
24                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20130328 Firefox/21.0',
25                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0',
26                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20130401 Firefox/21.0',
27                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20130331 Firefox/21.0',
28                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20100101 Firefox/21.0',
29                 'Mozilla/5.0 (Windows NT 5.0; rv:21.0) Gecko/20100101 Firefox/21.0',
30                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64;) Gecko/20100101 Firefox/20.0',
31                 'Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/19.0',
32                 'Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20100101 Firefox/18.0.1',
33                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0)  Gecko/20100101 Firefox/18.0',
34                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
35                 ]
36 headers = {
37     'User-Agent':random.choice(USER_AGENTS),
38     'Connection':'keep-alive',
39     'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2'
40     }
41 # 創建Kongfuzi.csv
42 file = open("Kongfuzi.csv", "a")
43 file.write("bookname" + "," + "publishing_house"  + "," + "price" +  "," + "bookTime_on_shelf" +  "," + "bookShop" + '\n')
44 file = file.close()
45 
46 def Kongfuzi(keyword):
47     try:
48         for i in range(0,keyword):
49             url = "https://book.kongfz.com/Cqikan/cat_10002w{}".format(str(i))
50             req = requests.get(url=url,headers=headers)
51             # print(req.text)
52             html = etree.HTML(req.text)
53             count = 1
54 
55             #書名bookname、出版社publishing_house、發貨率delivery、價格price、上架時間bookTime_on_shelf、書店bookShop
56             for i in range(50):
57                 bookname = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count))
58                 for i in bookname:
59                     bookname = i
60                 publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count))
61                 for i in publishing_house:
62                     publishing_house = i
63                 delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count))
64                 for i in delivery:
65                     delivery = i.strip("%")
66                 price = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count))
67                 for i in price:
68                     price = i
69                 bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count))
70                 for i in bookTime_on_shelf:
71                     bookTime_on_shelf = i
72                 bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count))
73                 for i in bookShop:
74                     bookShop = i
75                 count += 1
76                 #保存數據
77                 with open("Kongfuzi.csv", "a") as f2:
78                     f2.writelines(bookname + "," + publishing_house + "," + price +  "," + bookTime_on_shelf +  "," + bookShop + '\n')
79                     f2.close()
80 
81                 #顯示保存數據
82                 print(bookname,
83                       "出版社:",publishing_house,'\n',
84                       "發貨率:",delivery,'%\n',
85                       "價格:",price,'元\n',
86                       "上架時間:",bookTime_on_shelf,'\n',
87                       "書店名:",bookShop)
88                 print('\n')
89     except:
90         print("網絡錯誤")
91 
92 
93 if __name__ == '__main__':
94     keyword = input("爬取幾頁:")
95     Kongfuzi(int(keyword))

數據的清洗與處理

import pandas as pd
import numpy as np
# xs為銷量排行的表格、zh為綜合表排序
xs =  pd.read_csv(r'D:\Py_project\Kongfuzi.csv',error_bad_lines=False,encoding='gbk')
# 重復值處理
xs = xs.drop_duplicates('bookname')
# Nan處理
xs = xs.dropna(axis = 0)
# 根據價格數降序排序
xs.sort_values(by=["price"],inplace=True,ascending=[False])
xs.head(20)

  

 

 

 

# 價格排行可視化分析
import matplotlib.pyplot as plt
x = xs['bookname'].head(20)
y = xs['price'].head(20)
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False
plt.xticks(rotation=90)
plt.bar(x,y,alpha=0.2, width=0.4, color='b', lw=3,label="price")
plt.plot(x,y,'-',color = 'r',label="sell")
plt.legend(loc = "best")#圖例
plt.title("價格趨勢圖")
plt.xlabel("書名",)#橫坐標名字
plt.ylabel("價格")#縱坐標名字
plt.show()

 

 

 

 

 

 

 

plt.barh(x,y, alpha=0.2, height=0.4, color='g',label="價格", lw=3)
plt.title("價格水平圖")
plt.legend(loc = "best")#圖例
plt.xlabel("價格",)#橫坐標名字
plt.ylabel("書名")#縱坐標名字
plt.show()

 

 

 

 

# 散點圖
plt.scatter(x,y,color='gray',marker='o',s=40,alpha=0.5)
plt.xticks(rotation=90)
plt.title("價格散點圖")
plt.xlabel("主題",)#橫坐標名字
plt.ylabel("價格")#縱坐標名字
plt.show()

 

 

 

 

plt.boxplot(y)
plt.title("價格盒圖")
plt.show()

 

 雲詞:

import pandas as pd
import numpy as np
import wordcloud as wc
from PIL import Image
import matplotlib.pyplot as plt
import random

bk = np.array(Image.open(r"C:\Users\X0iaoyan\Downloads\111.jpg"))
mask = bk
# 定義尺寸
word_cloud = wc.WordCloud(
                       width=1000,  # 詞雲圖寬
                       height=1000,  # 詞雲圖高
                       mask = mask,
                       background_color='black',  # 詞雲圖背景顏色,默認為白色
                       font_path='msyhbd.ttc',  # 詞雲圖 字體(中文需要設定為本機有的中文字體)
                       max_font_size=400,  # 最大字體,默認為200
                       random_state=50,  # 為每個單詞返回一個PIL顏色
                       )
text = xs["bookname"]
text = " ".join(text)
word_cloud.generate(text)
plt.imshow(word_cloud)
plt.show()

 

 可視化分析總代碼:

 1 import pandas as pd
 2 import numpy as np
 3 # xs為銷量排行的表格、zh為綜合表排序
 4 xs =  pd.read_csv(r'D:\Py_project\Kongfuzi.csv',error_bad_lines=False,encoding='gbk')
 5 
 6 # 重復值處理
 7 xs = xs.drop_duplicates('bookname')
 8 # Nan處理
 9 xs = xs.dropna(axis = 0)
10 
11 # 根據價格數降序排序
12 xs.sort_values(by=["price"],inplace=True,ascending=[False])
13 xs.head(20)
14 
15 # 價格排行可視化分析
16 import matplotlib.pyplot as plt
17 x = xs['bookname'].head(20)
18 y = xs['price'].head(20)
19 plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
20 plt.rcParams['axes.unicode_minus']=False
21 plt.xticks(rotation=90)
22 plt.bar(x,y,alpha=0.2, width=0.4, color='b', lw=3,label="price")
23 plt.plot(x,y,'-',color = 'r',label="sell")
24 plt.legend(loc = "best")#圖例
25 plt.title("價格趨勢圖")
26 plt.xlabel("書名",)#橫坐標名字
27 plt.ylabel("價格")#縱坐標名字
28 plt.show()
29 
30 plt.barh(x,y, alpha=0.2, height=0.4, color='g',label="價格", lw=3)
31 plt.title("價格水平圖")
32 plt.legend(loc = "best")#圖例
33 plt.xlabel("價格",)#橫坐標名字
34 plt.ylabel("書名")#縱坐標名字
35 plt.show()
36 
37 # 散點圖
38 plt.scatter(x,y,color='gray',marker='o',s=40,alpha=0.5)
39 plt.xticks(rotation=90)
40 plt.title("價格散點圖")
41 plt.xlabel("主題",)#橫坐標名字
42 plt.ylabel("價格")#縱坐標名字
43 plt.show()
44 
45 plt.boxplot(y)
46 plt.title("價格盒圖")
47 plt.show()
48 
49 
50 import pandas as pd
51 import numpy as np
52 import wordcloud as wc
53 from PIL import Image
54 import matplotlib.pyplot as plt
55 import random
56 
57 bk = np.array(Image.open(r"C:\Users\X0iaoyan\Downloads\111.jpg"))
58 mask = bk
59 # 定義尺寸
60 word_cloud = wc.WordCloud(
61                        width=1000,  # 詞雲圖寬
62                        height=1000,  # 詞雲圖高
63                        mask = mask,
64                        background_color='black',  # 詞雲圖背景顏色,默認為白色
65                        font_path='msyhbd.ttc',  # 詞雲圖 字體(中文需要設定為本機有的中文字體)
66                        max_font_size=400,  # 最大字體,默認為200
67                        random_state=50,  # 為每個單詞返回一個PIL顏色
68                        )
69 text = xs["bookname"]
70 text = " ".join(text)
71 word_cloud.generate(text)
72 plt.imshow(word_cloud)
73 plt.show()

 

 五、總結

1.經過對主題數據的分析與可視化,可以得到哪些結論?是否達到預期的目標?
分析結果達到預期,可以看出價格趨勢走向。
2.在完成此設計過程中,得到哪些收獲?以及要改進的建議? 在此次設計過程種我對數據處理種的數據篩出有了很大的收獲,說白了就是怎么進行類型轉換,然后達到自己的想要的效果。受益匪淺!需要改進的地方可能就是編寫程序反應時間過慢了!編程經驗比較欠缺。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM