開學第一只蟲蟲,看看新學期有什么好看的書吧


開學啦,讓我們來看看豆瓣上有什么好書吧

首先當然是很正經地訪問一下網頁啦

網站網址是https://book.douban.com/top250?start=0

那么我們所需要的內容就是圖片旁邊的信息了,那就先讓蟲子爬過去吧,上吧小蟲蟲!!!

# -*- coding:UTF-8 -*-
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
            }
resp = requests.get('https://book.douban.com/top250?start=0',headers = headers)
# print(resp.text)

好了,get到源代碼了,再來看看自己需要的信息在哪里

就是這個,用beautifulsoup解析他!

soup = BeautifulSoup(resp.text,'lxml')


alldiv = soup.find_all('div', class_='pl2')
# print(alldiv)
'''
for a in alldiv:
    names = a.find('a')['title']
    print('find_all():',names)
可以簡化為下面兩條
'''
names = [a.find('a')['title'] for a in alldiv]
print(names)

歐克,同理找到其他信息

allp = soup.find_all('p', class_="pl")
authors = [p.get_text()for p in allp]
print(authors)

starspan = soup.find_all('span',class_="rating_nums")
scores = [s.get_text() for s in starspan]
print(scores)

sumspan = soup.find_all('span',class_="inq")
sums  = [i.get_text() for i in sumspan]
print(sums)

然后我們把資源都統一到一個列表里面

for name, author, score, sum in zip(names,authors,scores,sums):
name = '書名:'+ str(name)+ '\n'
author = '作者:' + str(author) + '\n'
score = '評分:' + str(score) + '\n'
sum = '評價:' + str(sum) + '\n'
data = name + author + score +sum

把他存起來!

filename = '豆瓣圖書排名top250.txt'
with open(filename,'w',encoding= 'utf-8') as f:
    f.writelines('data' + '=========================='+'\n')
    print('OK')

至此第一頁的內容我們就抓到了

那我們再來看看翻頁的網址

就最后一個數字發生了變化呢,看起來只要對請求的網頁改一下就好惹

base_url = 'https://book.douban.com/top250?start='
urllist = []
for page in range(0,250,25):
    allurl = base_url + str(page)
    urllist.append(allurl)

那么現在只要把所有的功能對象化,不斷爬取、翻頁、爬取就好啦

嗯,擼起來

# -*- coding:UTF-8 -*-
import requests
from bs4 import BeautifulSoup
def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
                }
    resp = requests.get(url, headers = headers)
    # print(resp.text)
    return resp

def html_parse():
    for url in all_page():
        soup = BeautifulSoup(get_html(url), 'lxml')
        alldiv = soup.find_all('div', class_='pl2')
        # print(alldiv)
        '''
        for a in alldiv:
            names = a.find('a')['title']
            print('find_all():',names)
        可以簡化為下面兩條
        '''
        names = [a.find('a')['title'] for a in alldiv]
        # print(names)
        allp = soup.find_all('p', class_="pl")
        authors = [p.get_text()for p in allp]
        # print(authors)
        starspan = soup.find_all('span', class_="rating_nums")
        scores = [s.get_text() for s in starspan]
        # print(scores)
        sumspan = soup.find_all('span', class_="inq")
        sums  = [i.get_text() for i in sumspan]
        # print(sums)
        for name, author, score, sum in zip(names,authors,scores,sums):
            name = '書名:'+ str(name)+ '\n'
            author = '作者:' + str(author) + '\n'
            score = '評分:' + str(score) + '\n'
            sum = '評價:' + str(sum) + '\n'
            data = name + author + score +sum
            f.writelines(data + '=========================='+'\n')

def all_page():
    base_url = 'https://book.douban.com/top250?start='
    urllist = []
    for page in range(0,250,25):
        allurl = base_url + str(page)
        urllist.append(allurl)
    return  urllist


filename = '豆瓣圖書排名top250.txt'
f = open(filename, 'w', encoding='utf-8')
html_parse()
f.close()
print('OK')

結果出現了報錯,喵喵喵?

我們把代碼稍微改一下

# -*- coding:UTF-8 -*-
import requests
from bs4 import BeautifulSoup


def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
                }
    resp = requests.get(url, headers = headers)
    print(resp.status_code)
    return resp

def html_parse(resp):
    # for url in all_page():
    #     get_html(url)
    #     print(resp)
        soup = BeautifulSoup(resp, 'lxml')
        alldiv = soup.find_all('div', class_='pl2')
        # print(alldiv)
        '''
        for a in alldiv:
            names = a.find('a')['title']
            print('find_all():',names)
        可以簡化為下面兩條
        '''
        names = [a.find('a')['title'] for a in alldiv]
        # print(names)
        allp = soup.find_all('p', class_="pl")
        authors = [p.get_text()for p in allp]
        # print(authors)
        starspan = soup.find_all('span', class_="rating_nums")
        scores = [s.get_text() for s in starspan]
        # print(scores)
        sumspan = soup.find_all('span', class_="inq")
        sums  = [i.get_text() for i in sumspan]
        # print(sums)
        for name, author, score, sum in zip(names,authors,scores,sums):
            name = '書名:'+ str(name)+ '\n'
            author = '作者:' + str(author) + '\n'
            score = '評分:' + str(score) + '\n'
            sum = '評價:' + str(sum) + '\n'
            data = name + author + score +sum
            f.writelines(data + '=========================='+'\n')

def all_page():
    base_url = 'https://book.douban.com/top250?start='
    urllist = []
    for page in range(0,250,25):
        allurl = base_url + str(page)
        urllist.append(allurl)
    return  urllist


filename = '豆瓣圖書排名top250.txt'
# f = open(filename, 'w', encoding='utf-8')
n = 0
for url in all_page():
    n = n+ 1
    print(n)
    get_html(url)
# html_parse(resp)
# f.close()
print('OK')

訪問網頁跳轉正常,那么就是源代碼輸出到解析函數中發生了錯誤咯?

filename = '豆瓣圖書排名top250.txt'
# f = open(filename, 'w', encoding='utf-8')
n = 0
resps = []
for url in all_page():
    n = n+ 1
    print(n)
    get_html(url)

print(resps)
# for response in resps
#     html_parse(resp)
# f.close()
print('OK')

輸出的是。。。。!!!

搜嘎!!!我忘記加上text了

 

就是這一行!!!

再把文件的打開加上,終於可以惹!!不過我把第二個循環提到了外面方便大家理解哈哈哈

# -*- coding:UTF-8 -*-
import requests
from bs4 import BeautifulSoup


def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
                }
    resp = requests.get(url, headers = headers).text
    resps.append(resp)
    # return resp

def html_parse(response):
    # for url in all_page():
    #     get_html(url)
    #     print(resp)
        soup = BeautifulSoup(response, 'lxml')
        alldiv = soup.find_all('div', class_='pl2')
        # print(alldiv)
        '''
        for a in alldiv:
            names = a.find('a')['title']
            print('find_all():',names)
        可以簡化為下面兩條
        '''
        names = [a.find('a')['title'] for a in alldiv]
        # print(names)
        allp = soup.find_all('p', class_="pl")
        authors = [p.get_text()for p in allp]
        # print(authors)
        starspan = soup.find_all('span', class_="rating_nums")
        scores = [s.get_text() for s in starspan]
        # print(scores)
        sumspan = soup.find_all('span', class_="inq")
        sums  = [i.get_text() for i in sumspan]
        # print(sums)
        for name, author, score, sum in zip(names,authors,scores,sums):
            name = '書名:'+ str(name)+ '\n'
            author = '作者:' + str(author) + '\n'
            score = '評分:' + str(score) + '\n'
            sum = '評價:' + str(sum) + '\n'
            data = name + author + score +sum
            f.writelines(data + '=========================='+'\n')

def all_page():
    base_url = 'https://book.douban.com/top250?start='
    urllist = []
    for page in range(0,250,25):
        allurl = base_url + str(page)
        urllist.append(allurl)
    return  urllist


filename = '豆瓣圖書排名top250.txt'
f = open(filename, 'w', encoding='utf-8')
resps = []
for url in all_page():
    get_html(url)
print(resps)
for response in resps:
   html_parse(response)
f.close()
print('OK')

結果如圖:

我們再打開文件看一下

nice啊哈哈哈哈

 

 

補充,我在相關的一些爬取教程中看見說豆瓣有的時候書名和簡介不服

這是因為有一些簡介缺失了

所以可以補充以下代碼:

sum_div = soup.select('tr.item>td:nth-of-type(2)')
sums = []
for di in sum_div:
    sumspan = div.find('span',class_='inq')
    summary = sumspan.get_text() if sumspan else ''
    sums.append(summary)

好了,今天就爬到這里了,你已經做的很好了,回來吧小蟲蟲!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM