Python爬蟲爬取豆瓣讀書

本文轉載自查看原文 2018-12-28 16:15 1810

一，准備工作。

工具：win10+Python3.6

爬取目標：爬取圖中紅色方框的內容。

原則：能在源碼中看到的信息都能爬取出來。

信息表現方式：CSV轉Excel。

二，具體步驟。

先給出具體代碼吧：

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4 import pandas as pd
 5 
 6 def gethtml(url):
 7     try:
 8         r = requests.get(url,timeout = 30)
 9         r.raise_for_status()
10         r.encoding = r.apparent_encoding
11         return r.text
12     except:
13         return "It is failed to get html!"
14 
15 def getcontent(url):
16     html = gethtml(url)
17     soup = BeautifulSoup(html,"html.parser")
18     # print(soup.prettify())
19     div = soup.find("div",class_="indent")
20     tables = div.find_all("table")
21 
22     price = []
23     date = []
24     nationality = []
25     nation = []  #standard
26     bookname=[]
27     link = []
28     score = []
29     comment = []
30     people = []
31     peo = []  #standard
32     author = []
33     for table in tables:
34         bookname.append(table.find_all("a")[1]['title'])   #bookname
35         link.append(table.find_all("a")[1]['href'])    #link
36         score.append(table.find("span",class_="rating_nums").string)   #score
37         comment.append(table.find_all("span")[-1].string)   #comment in a word
38 
39         people_info = table.find_all("span")[-2].text
40         people.append(re.findall(r'\d+', people_info))  #How many people comment on this book? Note:But there are sublist in the list.
41 
42         navistr = (table.find("p").string)   #nationality,author,translator,press,date,price
43         infos = str(navistr.split("/"))   #Note this info:The string has been interrupted.
44         infostr = str(navistr)            #Note this info:The string has not been interrupted.
45         s = infostr.split("/")
46         if re.findall(r'\[', s[0]):  # If the first character is "[",match the author.
47             w = re.findall(r'\s\D+', s[0])
48             author.append(w[0])
49         else:
50             author.append(s[0])
51 
52         #Find all infomations from infos.Just like price,nationality,author,translator,press,date
53         price_info = re.findall(r'\d+\.\d+', infos)
54         price.append((price_info[0]))   #We can get price.
55         date.append(s[-2])  #We can get date.
56         nationality_info = re.findall(r'[[](\D)[]]', infos)
57         nationality.append(nationality_info)   #We can get nationality.Note:But there are sublist in the list.
58     for i in nationality:
59         if len(i) == 1:
60             nation.append(i[0])
61         else:
62             nation.append("中")
63 
64     for i in people:
65         if len(i) == 1:
66             peo.append(i[0])
67 
68     print(bookname)
69     print(author)
70     print(nation)
71     print(score)
72     print(peo)
73     print(date)
74     print(price)
75     print(link)
76 
77     # 字典中的key值即為csv中列名
78     dataframe = pd.DataFrame({'書名': bookname, '作者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'鏈接': link,})
79 
80     # 將DataFrame存儲為csv,index表示是否顯示行名，default=True
81     dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',')
82 
83 
84 if __name__ == '__main__':
85     url = "https://book.douban.com/top250?start=0"   #If you want to add next pages,you have to alter the code.
86     getcontent(url)

1，爬取大致信息。

選用如下輪子：

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4 
 5 def gethtml(url):
 6     try:
 7         r = requests.get(url,timeout = 30)
 8         r.raise_for_status()
 9         r.encoding = r.apparent_encoding
10         return r.text
11     except:
12         return "It is failed to get html!"
13 
14 def getcontent(url):
15     html = gethtml(url)
16     bsObj = BeautifulSoup(html,"html.parser")
17 
18 
19 if __name__ == '__main__':
20     url = "https://book.douban.com/top250?icn=index-book250-all"
21     getcontent(url)

這樣就能從bsObj獲取我們想要的信息。

2，信息具體提取。

所有信息都在一個div中，這個div下有25個table，其中每個table都是獨立的信息單元，我們只用造出提取一個table的輪子（前提是確保這個輪子的兼容性）。我們發現：一個div父節點下有25個table子節點，用如下方式提取：

    div = soup.find("div",class_="indent")
    tables = div.find_all("table")

書名可以直接在節點中的title中提取（原始代碼確實這么丑，但不影響）：

<a href="https://book.douban.com/subject/1770782/" onclick="&quot;moreurl(this,{i:'0'})&quot;" title="追風箏的人">
                追風箏的人

                
              </a>

據如下代碼提取：

bookname.append(table.find_all("a")[1]['title'])   #bookname

相似的不贅述。

評價人數打算用正則表達式提取：

people.append(re.findall(r'\d+', people_info))  #How many people comment on this book? Note:But there are sublist in the list.

people_info = 13456人評價。
在看其余信息：

<p class="pl">[美] 卡勒德·胡賽尼 / 李繼宏 / 上海人民出版社 / 2006-5 / 29.00元</p>

其中國籍有個“【】”符號，如何去掉？第一行給出回答。

nationality_info = re.findall(r'[[](\D)[]]', infos)  
        nationality.append(nationality_info)   #We can get nationality.Note:But there are sublist in the list.
    for i in nationality:
        if len(i) == 1:
            nation.append(i[0])
        else:
            nation.append("中")

其中有國籍的都寫出了，但是沒寫出的我們發現都是中國，所以我們把國籍為空白的改寫為“中”：

    for i in nationality:
        if len(i) == 1:
            nation.append(i[0])
        else:
            nation.append("中")

還有list中存在list的問題也很好解決：

    for i in people:
        if len(i) == 1:
            peo.append(i[0])

長度為1證明不是空序列，就加上序號填寫處具體值，使變為一個沒有子序列的序列。

打印結果如下圖：

基本是我們想要的了。

然后寫入csv：

    dataframe = pd.DataFrame({'書名': bookname, '作者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'鏈接': link,})

    # 將DataFrame存儲為csv,index表示是否顯示行名，default=True
    dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',')

注意：如果沒有加上encoding='utf-8-sig'會存在亂碼問題，所以這里必須得加，當然你用其他方法也可。

最后一個翻頁的問題，這里由於我沒做好兼容性問題，所以后面的頁碼中提取信息老是出問題，但是這里還是寫一下方法：

    for i in range(10):
        url = "https://book.douban.com/top250?start=" + str(i*25)
        getcontent(url)

注意要加上str。

效果圖：

其實這里的效果圖與我寫入csv的傳人順序不一致，后期我會看看原因。

三，總結。

大膽細心，這里一定要細心，很多細節不好好深究后面會有很多東西修改。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲-靜態爬取豆瓣評論 python爬蟲-爬取豆瓣電影數據【python爬蟲實戰】爬取豆瓣影評數據 python爬蟲實踐——爬取“豆瓣top250” Python爬蟲-爬取豆瓣圖書Top250 Python爬蟲實例：爬取豆瓣Top250 python爬蟲爬取豆瓣電視劇數據 Python爬蟲入門教程：豆瓣Top電影爬取 Python爬蟲——爬取豆瓣電影Top250 初識python 之爬蟲：爬取豆瓣電影最熱評論