超詳細Python-一鍵爬取圖片、音頻、視頻資源

使用Python爬取任意網頁的資源文件，比如圖片、音頻、視頻；一般常用的做法就是把網頁的HTML請求下來通過XPath或者正則來獲取自己想要的資源，這里我做了一個爬蟲工具軟件，可以一鍵爬取資源媒體文件；但是需要說明的是，這里爬取資源文件只針對HTML已有的文件，如果需要二次請求的是爬取不到的，比如酷狗音樂播放界面，因為要做通用工具，匹配不同的網站！！！😀😀😀

這里主推圖片爬取，一些需要圖片素材的可以輸入網址一鍵爬取！

這里要注意：不管你是為了Python就業還是興趣愛好，記住：項目開發經驗永遠是核心，如果你沒有2020最新python入門到高級實戰視頻教程，可以去小編的Python交流.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，里面很多新python教程項目，還可以跟老司機交流討教！

還有就是爬取視頻的時候會把磁力鏈接爬取下來！可以使用第三方下載工具下載！🤗

代碼

爬取資源文件

這里需要說明的就只，有的圖片資源並不是url鏈接，是data:image格式，這里需要轉換一下存儲！

def getResourceUrlList(url ,isImage, isAudio, isVideo):
	global imgType_list, audioType_list, videoType_list
	imageUrlList = []
	audioUrlList = []
	videoUrlList = []
 
	url = url.rstrip().rstrip('/') htmlStr = str(requestsDataBase(url)) # print(htmlStr) Wopen = open('reptileHtml.txt','w') Wopen.write(htmlStr) Wopen.close() Ropen = open('reptileHtml.txt','r') imageUrlList = [] for line in Ropen: line = line.replace("'", '"') segmenterStr = '"' if "'" in line: segmenterStr = "'" lineList = line.split(segmenterStr) for partLine in lineList: if isImage == True: # 查找圖片 if 'data:image' in partLine: base64List = partLine.split('base64,') imgData = base64.urlsafe_b64decode(base64List[-1] + '=' * (4 - len(base64List[-1]) % 4)) base64ImgType = base64List[0].split('/')[-1].rstrip(';') imageName = zfjTools.getTimestamp() + '.' + base64ImgType imageUrlList.append(imageName + '$==$' + base64ImgType) # 查找圖片 for imageType in imgType_list: if imageType in partLine: imgUrl = partLine[:partLine.find(imageType) + len(imageType)].split(segmenterStr)[-1] # 修復URL imgUrl = repairUrl(imgUrl, url) sizeType = '_{' + 'size' + '}' if sizeType in imgUrl: imgUrl = imgUrl.replace(sizeType, '') imgUrl = imgUrl.strip() if imgUrl.startswith('http://') or imgUrl.startswith('https://') and imgUrl not in imageUrlList: imageUrlList.append(imgUrl) else: imgUrl = '' if isAudio == True: # 查找音頻 for audioType in audioType_list: if audioType in partLine or audioType.lower() in partLine: audioType = audioType.lower() if audioType.lower() in partLine else audioType audioUrl = partLine[:partLine.find(audioType) + len(audioType)].split(segmenterStr)[-1] # 修復URL audioUrl = repairUrl(audioUrl, url) if audioUrl.startswith('http://') or audioUrl.startswith('https://') and audioUrl not in audioUrlList: audioUrlList.append(audioUrl) else: audioUrl = '' if isVideo == True: # 查找視頻 for videoType in videoType_list: if videoType in partLine or videoType.lower() in partLine: videoType = videoType.lower() if videoType.lower() in partLine else videoType videoUrl = partLine[:partLine.find(videoType) + len(videoType)].split(segmenterStr)[-1] # 修復URL videoUrl = repairUrl(videoUrl, url) if videoUrl.startswith('http://') or videoUrl.startswith('https://') or videoUrl.startswith('ed2k://') or videoUrl.startswith('magnet:?') or videoUrl.startswith('ftp://') and videoUrl not in videoUrlList: videoUrlList.append(videoUrl) else: videoUrl = '' return (imageUrlList, audioUrlList, videoUrlList) 復制代碼

爬取自定義節點

# 統配節點爬取 def getNoteInfors(url, fatherNode, childNode): url = url.rstrip().rstrip('/') htmlStr = requestsDataBase(url) Wopen = open('reptileHtml.txt','w') Wopen.write(htmlStr) Wopen.close() html_etree = etree.HTML(htmlStr) dataArray = [] if html_etree != None: nodes_list = html_etree.xpath(fatherNode) for k_value in nodes_list: partValue = k_value.xpath(childNode) if len(partValue) > 0: dataArray.append(partValue[0]) return dataArray 復制代碼

軟件

軟件下載地址gitee.com/zfj1128/ZFJ…

使用教學視頻

資源爬取：鏈接:pan.baidu.com/s/1xa9ruF_h… 密碼:1zpg

節點爬取：鏈接:pan.baidu.com/s/1ebWWYtjo… 密碼:cosa

使用截圖如下：

最后注意：不管你是為了Python就業還是興趣愛好，記住：項目開發經驗永遠是核心，如果你沒有2020最新python入門到高級實戰視頻教程，可以去小編的Python交流.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，里面很多新python教程項目，還可以跟老司機交流討教！

本文的文字及圖片來源於網絡加上自己的想法,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯系我們以作處理。

超詳細Python-一鍵爬取圖片、音頻、視頻資源

前言

代碼

爬取資源文件

爬取自定義節點

軟件

免責聲明！