【python3】爬取鼠繪漢化的海賊王漫畫

本文轉載自查看原文 2018-06-29 16:08 2627

特別說明：

因為早些時候鼠繪的接口調整，之前的代碼已經不能用了。

正好最近在學習scrapy，於是重新寫了一個，項目放在github https://github.com/TurboWay/ishuhui

一、起因：

　　很喜歡看海賊漫畫，其中鼠繪漢化的海賊王無疑是最好的，更新最快的。但是由於版權的問題，迫於壓力，鼠繪官網早一點的海賊王已經看不了，但是。。。重點是，我發現接口還是可以用的，於是就寫了個爬蟲把鼠繪翻譯的海賊王漫畫都爬了下來。分享下源碼，供有需要的海迷使用。另外建議不要在高峰時段爬取，畢竟我們都愛鼠繪。

二、如何使用：

　　有安裝python環境的，直接復制源碼，運行.py

三、代碼如下：

# -*- coding: utf-8 -*-
import requests,json,time,os,shutil,logging,sys
from PIL import Image
from io import BytesIO

logger = logging.getLogger('log')
logger.setLevel(logging.DEBUG)

# log format
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# console log
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(formatter)
logger.addHandler(ch)

def get_url(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
                             ' Chrome/62.0.3202.75 Safari/537.36'}
    response = requests.get(url=url, headers=headers, timeout=5)
    js = json.loads(response.text)
    if js["errNo"] == 0:
        return js["data"]
    else:
        logger.warning("請求失敗：{0}".format(js))

# 去掉文件名禁止符號
def clean(text):
    kws = ['/','\\',':','*','"','<','>','|','？']
    for kw in kws:
        text = text.replace(kw,'.')
    return text

# 新建文件夾
def makefile(path,istruncate):
    if os.path.exists(path) and istruncate:
        shutil.rmtree(path)
        os.mkdir(path)
    elif not os.path.exists(path):
        os.mkdir(path)

# 下載圖片
def save_pic(img_src,picname):
    try:
        response = requests.get(img_src)
        image = Image.open(BytesIO(response.content))
        image = image.convert('RGB')
        image.save(picname)
        logger.info("{0}圖片下載成功".format(picname))
        flag = True
    except Exception as e:
        logger.info("{0}圖片下載失敗:{1}".format(picname,e))
        flag = False
    return flag

# 保存圖片
def resave_pic(img_src,picname):
    count,flag = 0,save_pic(img_src,picname)
    while not flag:
        flag = save_pic(img_src, picname)
        count += 1
        if count > 5:
            break

def get_data(path,nextid):
    url = 'http://hhzapi.ishuhui.com/cartoon/post/ver/76906890/id/{0}.json'.format(nextid)
    data = get_url(url)
    if data:
        server = 'http://pic04.ishuhui.com/'
        source, id, title, book, number = data['source'], data['id'], data['title'], data['book_text'], data['number']
        content_img = eval(data['content_img']) if data['content_img'] else {}
        if source == 1: # 鼠繪漢化
            makefile(path + '\\' + book, False)
            title = clean(title)
            filepath = path + '\{0}\{0} 第 {1} 話 {2}'.format(book,number,title)
            makefile(filepath, True) # 新建文件夾
            if content_img: # 下載圖片
                for img, imgurl in content_img.items():
                    imgurl = server + imgurl.replace('/upload/','')
                    picname = filepath + '\\'+ img
                    resave_pic(imgurl,picname)
            logger.info("ID:{2} 第 {0} 話 {1}下載完成".format(number,title,id))
            next = data['prev']
            if next:
                return next['id']
            elif nextid == 900: # 900的時候會找不到上一頁
                return 899

if __name__ == "__main__":
    path=sys.path[0]
    nextid=get_data(path,10881)
    while nextid:
        nextid=get_data(path,nextid)
        time.sleep(3)

四、結果如下：

　　第598話 2年后 -- 第908話世界會議開幕，共309話，3.22G，其中680和681話缺失了，接口掃了一下也沒找到。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy 動態網頁處理——爬取鼠繪海賊王最新漫畫 Python 學習筆記---爬取海賊王動漫利用python3爬蟲爬取漫畫島-非人哉漫畫《海賊王》路飛的團隊建設 python爬取漫畫海賊王革命家—龍—實力到底如何? 18.06.27 POJ百練 4124海賊王之偉大航路【編程之外】從《海賊王》的視角走進BAT的世界 ACM 海賊王之偉大航路(深搜剪枝) python爬蟲之爬取漫畫（一）