【Python】用python將html轉化為pdf

本文轉載自查看原文 2020-11-04 12:52 460

其實早在去年就有做過，一直沒有寫，先簡單記錄下

1、主要用到的工具【wkhtmltopdf】

【下載地址】https://wkhtmltopdf.org/downloads.html

根據系統選擇安裝包，速度有點慢，先掛着

2、下載Python庫

pip install pdfkit
pip install wkhtmltopdf

3、簡單代碼驗證

import pdfkit
pdfkit.from_url('http://baidu.com','out.pdf')
pdfkit.from_file('test.html','out1.pdf')
pdfkit.from_string('Hello World!','out2.pdf')

返回Done、True說明環境沒有問題了

輸出的pdf文件

打開pdf

源html是動態大尺寸，pdf顯示靜態，尺寸有減小

文件打開正常，說明代碼沒有問題，后面就可以自由發揮爬蟲技能

此外支持列表

pdfkit.from_url(['google.com', 'yandex.ru', 'engadget.com'], 'out.pdf')
pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')

支持文件對象

with open('file.html') as f:
    pdfkit.from_file(f, 'out.pdf')

作為string變量，操作pdf

# Use False instead of output path to save pdf to a variable
pdf = pdfkit.from_url('http://google.com', False)

指定pdf格式（選項設置）

參考https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

options = {
    'page-size': 'Letter',
    'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'encoding': "UTF-8",
    'custom-header' : [
        ('Accept-Encoding', 'gzip')
    ]
    'cookie': [
        ('cookie-name1', 'cookie-value1'),
        ('cookie-name2', 'cookie-value2'),
    ],
    'no-outline': None
}

pdfkit.from_url('http://google.com', 'out.pdf', options=options)

默認的，pdfkit會show出所有的output，如果你不想使用，可以設置為quite：

options = {'quiet': ''}

pdfkit.from_url('google.com', 'out.pdf', options=options)

傳入任何html標簽【煩人廣告說拜拜，真正做到網頁私人定制】

body = """
    <html>
      <head>
        <meta name="pdfkit-page-size" content="Legal"/>
        <meta name="pdfkit-orientation" content="Landscape"/>
      </head>
      Hello World!
      </html>
    """

pdfkit.from_string(body, 'out.pdf') #with --page-size=Legal and --orientation=Landscape

【改進】

將之前的save_file方法改成save_to_pdf，並且在get_body方法中直接返回str(div)，而不是div.text。代碼如下：

def save_to_pdf(url):
    '''
    根據url，將文章保存到本地
    :param url:
    :return:
    '''
    title=get_title(url)
    body=get_Body(url)
    filename=author+'-'+title+'.pdf'
# windows系統文件名特殊字符，建議網上百度，然后替換即可
    if '/' in filename:
        filename=filename.replace('/','+')
    if '\\' in filename:
        filename=filename.replace('\\','+')
    print(filename)
    options = {
        'page-size': 'Letter',
        'encoding': "UTF-8",
        'custom-header': [
            ('Accept-Encoding', 'gzip')
        ]
    }

    config=pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
    pdfkit.from_string(body,filename,options=options,configuration=config)
    print('打印成功！')

【文件命名規范】

自媒體的出現，文件命名開始五花八門，下面用一行代碼去除非法字符

# Python中過濾Windows文件名中的非法字符
import re

title='xxxxxxx'

fileName = re.sub(r'[\\/:*?"<>|\r\n]+','-',title)

# 去掉非法字符,在[]中*不需要轉義,此時*不表示多次匹配,就表示本身的字符

以后遇到好的文章，可以自己采集，存為pdf，再也不用擔心源網站刪除，存到自己電腦里才放心。

【參考鏈接】

https://blog.csdn.net/xc_zhou/article/details/80952168

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲實戰【3】Python-如何將html轉化為pdf(PdfKit) 怎么將網頁Html轉化為PDF Python 將圖片轉化為 HTML 頁面 windows下用Python把pdf文件轉化為圖片 Python實現將excel文件轉化為html文件 Python將CSV文件轉化為HTML文件的操作方法 Python實現將csv文件轉化為html文件 python將圖像向轉化為點陣 python3.6.3中html頁面轉化成pdf CAJ 轉化為PDF