爬取漫畫DB上的JoJo的奇妙冒險第七部飆馬野郎

本文轉載自查看原文 2019-10-30 22:32 1939 爬蟲/ Beautiful Soup/ 文件存儲

SBR是JOJO系列我最喜歡的一部，所以今天把漫畫爬取到本地，日后慢慢看。

import re
import time
import requests
from requests import codes
from bs4 import BeautifulSoup
from requests import RequestException

def get_page(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36'
                   + '(KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def get_pagesNumber(text):
    soup = BeautifulSoup(text, 'lxml')
    pagesNumber = soup.find(name='div', class_="d-none vg-r-data")
    return pagesNumber.attrs['data-total'] 
    
def parse_page(text):
    soup = BeautifulSoup(text, 'lxml')
    url = soup.find(name='img', class_="img-fluid show-pic")
    chapter = soup.find(name='h2', class_="h4 text-center")
    page = soup.find(name='span', class_="c_nav_page")
    yield {
        'url': url['src'],
        'chapter': chapter.get_text(),
        'page': page.get_text()
    }
#return 在返回結果后 結束函數的運行
#而yield 則是讓函數變成一個生成器，生成器每次產生一個值，函數被凍結，被喚醒后再產生一個值
    
    
def save_image(item):
    img_path = 'SBR' + os.path.sep + item.get('chapter') #os.path.sep是路徑分隔符\
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    try:
        resp = requests.get(item.get('url'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=item.get('page'), file_suffix='jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e)

if __name__ == '__main__':
    for chapter in range(292, 316): #觀察可發現共24章節，292到315 彩漫13283, 13306
        url = 'https://www.manhuadb.com/manhua/147/4_'+str(chapter)+'.html'
        text = get_page(url) 
        pagesNumber = get_pagesNumber(text) #獲取當前章節總頁數
        for page in range(1,int(pagesNumber)+1):
            url = 'https://www.manhuadb.com/manhua/147/4_'+str(chapter)+'_'+str(page)+'.html'
　　　　　　　#彩漫#url = 'https://www.manhuadb.com/manhua/147/1330_'+str(chapter)+'_'+str(page)+'.html'
            text = get_page(url)
            for item in parse_page(text):
                save_image(item)

最后得到，

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 第七部分（一）動態渲染頁面爬取（Selenium的使用） Webpack 4教程 - 第七部分減少打包體積與Tree Shaking Python | 《國王排名》漫畫爬取並合成pdf文件用C#寫個小程序爬取漫畫爬取漫畫網站並進行圖片拼接手把手教python爬取漫畫(每一步都有注釋) Python學習第七天之爬蟲的學習與使用（爬取文字、圖片、視頻） python爬蟲：爬取貓眼TOP100榜的100部高分經典電影第十部分模擬登錄（模擬登錄GitHub並爬取、Cookies池的搭建）

爬取漫畫DB上的JoJo的奇妙冒險 第七部 飆馬野郎

免責聲明！

爬取漫畫DB上的JoJo的奇妙冒險第七部飆馬野郎