python下載網頁轉化成pdf

本文轉載自查看原文 2017-07-02 21:19 1749 爬蟲/ python

最近在學習一個網站補充一下cg基礎。但是前幾天網站突然訪問不了了，同學推薦了waybackmachine這個網站，它定期的對網絡上的頁面進行緩存，但是好多圖片刷不出來，很憋屈。於是網站恢復訪問后決定把網頁爬下來存成pdf。

兩點收獲：

1.下載網頁時圖片、css等文件也下載下來，並且修改html中的路徑。

2. beautifulsoup、wkhtmltopdf很強大，用起來很舒心

前期准備工作：

0.安裝python

1.安裝pip

下載pip的安裝包get-pip.py，下載地址：https://pip.pypa.io/en/latest/installing.html#id7

然后在get-pip.py所在的目錄下運行get-pip.py

執行完成后，在python的安裝目錄下的Scripts子目錄下，可以看到pip.exe

升級的話用 python -m pip install -U pip

2. 安裝wkhtmltopdf : 適用於多平台的 html 到 pdf 的轉換工具

3. install requests、beautifulsoup、pdfkit.

pdfkit 是 wkhtmltopdf 的Python封裝包

beautifulsoup用於操縱html內容。

2.代碼實現

from _ssl import PROTOCOL_TLSv1
from functools import wraps
import os
from ssl import SSLContext
import ssl
from test.test_tools import basepath
import urllib
from urllib.parse import urlparse  # py3

from bs4 import BeautifulSoup
import requests
import urllib3


def sslwrap(func):
    @wraps(func)
    def bar(*args, **kw):
        kw['ssl_version'] = ssl.PROTOCOL_TLSv1
        return func(*args, **kw)
    return bar

def save(url,cls,outputDir,outputFile):
    print("saving " + url);
    response = urllib.request.urlopen(url,timeout=500)
    soup = BeautifulSoup(response,"html5lib")
    #set css
   
    #save imgs

    #save html
    if(os.path.exists(outputDir+outputFile)):
        os.remove(outputDir+outputFile);
    if(cls!=""):
        body = soup.find_all(class_=cls)[0]
        with open(outputDir+outputFile,'wb') as f:
            f.write(str(body).encode(encoding='utf_8'))
    else:
        with open(outputDir+outputFile,'wb') as f:
            f.write(str(soup.find_all("html")).encode(encoding='utf_8'))
    print("finish!");
    return soup;


def crawl(base,outDir):
    ssl._create_default_https_context = ssl._create_unverified_context
    heads = save(base+"/index.php?redirect","central-column",outDir,"/head.html");
    for link in heads.find_all('a'):
        pos = str(link.get('href'))
        if(pos.startswith('/lessons')==True):
            curDir = outDir+pos;
            if(os.path.exists(curDir)==False):
                makedirs(curDir)
            else:
                print("already exist " + curDir);
                continue
                
            counter = 1;
            while(True):
                body = save(base+pos,"",curDir,"/"+str(counter)+".html")
                counter+=1;
                
                hasNext = False;
                for div in body.find_all("div",class_="footer-prev-next-cell"):
                    if(div.get("style")=="text-align: right;"):
                        hrefs = div.find_all("a");
                        if(len(hrefs)>0):
                            hasNext = True;     
                            pos = hrefs[0]['href'];
                            print(">>next is at:"+pos)
                        break;
                if(hasNext==False):
                    break;

if __name__ == '__main__':
    crawl("https://www.***.com", "E:/Documents/CG/***");
    print("finish")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 如何下載網頁上的視頻並且 flv 格式轉化成 MP4 python3.6.3中html頁面轉化成pdf 【JAVA】將PDF轉化成圖片 Python 轉化成 PB 格式數據 Python str轉化成數字 Python將字典轉化成列表 Java使用Jacob將Word、Excel、PPT轉化成PDF 如何將Word 或 PDF文件轉化成長圖片？推薦幾個在線PDF轉化成Word網站 Python將list中的string批量轉化成int/float