最近在學習一個網站補充一下cg基礎。但是前幾天網站突然訪問不了了,同學推薦了waybackmachine這個網站,它定期的對網絡上的頁面進行緩存,但是好多圖片刷不出來,很憋屈。於是網站恢復訪問后決定把網頁爬下來存成pdf。
兩點收獲:
1.下載網頁時圖片、css等文件也下載下來,並且修改html中的路徑。
2. beautifulsoup、wkhtmltopdf很強大,用起來很舒心
前期准備工作:
0.安裝python
1.安裝pip
下載pip的安裝包get-pip.py,下載地址:https://pip.pypa.io/en/latest/installing.html#id7
然后在get-pip.py所在的目錄下運行get-pip.py
執行完成后,在python的安裝目錄下的Scripts子目錄下,可以看到pip.exe
升級的話用 python -m pip install -U pip
2. 安裝wkhtmltopdf : 適用於多平台的 html 到 pdf 的轉換工具
3. install requests、beautifulsoup、pdfkit.
pdfkit 是 wkhtmltopdf 的Python封裝包
beautifulsoup用於操縱html內容。
2.代碼實現
from _ssl import PROTOCOL_TLSv1
from functools import wraps
import os
from ssl import SSLContext
import ssl
from test.test_tools import basepath
import urllib
from urllib.parse import urlparse # py3
from bs4 import BeautifulSoup
import requests
import urllib3
def sslwrap(func):
@wraps(func)
def bar(*args, **kw):
kw['ssl_version'] = ssl.PROTOCOL_TLSv1
return func(*args, **kw)
return bar
def save(url,cls,outputDir,outputFile):
print("saving " + url);
response = urllib.request.urlopen(url,timeout=500)
soup = BeautifulSoup(response,"html5lib")
#set css
#save imgs
#save html
if(os.path.exists(outputDir+outputFile)):
os.remove(outputDir+outputFile);
if(cls!=""):
body = soup.find_all(class_=cls)[0]
with open(outputDir+outputFile,'wb') as f:
f.write(str(body).encode(encoding='utf_8'))
else:
with open(outputDir+outputFile,'wb') as f:
f.write(str(soup.find_all("html")).encode(encoding='utf_8'))
print("finish!");
return soup;
def crawl(base,outDir):
ssl._create_default_https_context = ssl._create_unverified_context
heads = save(base+"/index.php?redirect","central-column",outDir,"/head.html");
for link in heads.find_all('a'):
pos = str(link.get('href'))
if(pos.startswith('/lessons')==True):
curDir = outDir+pos;
if(os.path.exists(curDir)==False):
makedirs(curDir)
else:
print("already exist " + curDir);
continue
counter = 1;
while(True):
body = save(base+pos,"",curDir,"/"+str(counter)+".html")
counter+=1;
hasNext = False;
for div in body.find_all("div",class_="footer-prev-next-cell"):
if(div.get("style")=="text-align: right;"):
hrefs = div.find_all("a");
if(len(hrefs)>0):
hasNext = True;
pos = hrefs[0]['href'];
print(">>next is at:"+pos)
break;
if(hasNext==False):
break;
if __name__ == '__main__':
crawl("https://www.***.com", "E:/Documents/CG/***");
print("finish")