最近有爬蟲相關的需求,所以上B站找了個視頻(鏈接在文末)看了一下,做了一個小程序出來,大體上沒有修改,只是在最后的存儲上,由txt換成了excel。
- 簡要需求:爬蟲爬取 貓眼電影TOP100榜單 數據
- 使用語言:python
- 工具:PyCharm
- 涉及庫:requests、re、openpyxl(高版本excel操作庫)
實現代碼
# -*- coding: utf-8 -*- # @Author : yocichen # @Email : yocichen@126.com # @File : maoyan100.py # @Software: PyCharm # @Time : 2019 # @UpdateTime : 2020/4/26 import requests from requests import RequestException import re import openpyxl import traceback # Get page's html by requests module def get_one_page(url): try: headers = { 'user-agent': 'Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 53.0.2785.104Safari / 537.36Core / 1.53.4882.400QQBrowser / 9.7.13059.400' } # Sometimes, the proxies need to be replaced. # You can get them by accessing https://www.kuaidaili.com/free/inha/ proxies = { 'http': '60.190.250.120:8080' } # use headers to avoid 403 Forbidden Error(reject spider) response = requests.get(url, headers=headers, proxies=proxies) if response.status_code == 200 : return response.text return None except RequestException: traceback.print_exc() return None # Get useful info from html of a page by re module def parse_one_page(html): try: pattern = re.compile('<dd>.*?board-index.*?>(\d+)<.*?<a.*?title="(.*?)"' +'.*?data-src="(.*?)".*?</a>.*?star">[\\s]*(.*?)[\\n][\\s]*</p>.*?' +'releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?' +'fraction">(.*?)</i>.*?</dd>', re.S) items = re.findall(pattern, html) return items except Exception: traceback.print_exc() return [] # Main call function def main(url): page_html = get_one_page(url) parse_res = parse_one_page(page_html) return parse_res # Write the useful info in excel(*.xlsx file) def write_excel_xlsx(items): wb = openpyxl.Workbook() ws = wb.active rows = len(items) cols = len(items[0]) # First, write col's title. ws.cell(1, 1).value = '編號' ws.cell(1, 2).value = '片名' ws.cell(1, 3).value = '宣傳圖片' ws.cell(1, 4).value = '主演' ws.cell(1, 5).value = '上映時間' ws.cell(1, 6).value = '評分' # Write film's info for i in range(0, rows): for j in range(0, cols): if j != 5: ws.cell(i+2, j+1).value = items[i][j] else: ws.cell(i+2, j+1).value = items[i][j]+items[i][j+1] break # Save the work book as *.xlsx wb.save('maoyan_top100.xlsx') if __name__ == '__main__': print('spider working...') res = [] url = 'https://maoyan.com/board/4?' for i in range(0, 10): if i == 0: res = main(url) else: newUrl = url+'offset='+str(i*10) res.extend(main(newUrl)) print('writing into excel...') write_excel_xlsx(res) print('work done!\nNote: the data is in the current directory.')
更新效果圖:
后記
入門了一點后發現,如果使用正則表達式和requests庫來實行進行數據爬取的話,分析HTML頁面結構和正則表達式的構造是關鍵,剩下的工作不過是替換url罷了。
補充一個分析HTML構造正則的例子
審查元素我們會發現每一項都是<dd>****</dd>格式
我想要獲取電影名稱和評分,先拿出HTML代碼看一看
試着構造正則
'.*?<dd>.*?movie-item-title.*?title="(.*?)">.*?integer">(.*?)<.*?fraction">(.*?)<.*?</dd>' (隨手寫的,未經驗證)
參考資料
【B站視頻 2018年最新Python3.6網絡爬蟲實戰】https://www.bilibili.com/video/av19057145/?p=14
【貓眼電影robots】https://maoyan.com/robots.txt (最好爬之前去看一下,那些可爬那些不允許爬)