爬蟲——用正則表達式以及BeautifulSoup兩種方法爬取豆瓣電影TOP100

本文轉載自查看原文 2017-10-07 00:23 1338

（一）正則表達式：

1.獲取HTML內容：

　　html=urllib.request.urlopen(url)

　　html=html.read().decode('utf-8')——注意編碼

2.爬取需要的信息點，提取正則表達式：

　　key=re.compile(r'正則表達式')

　　information=re.findall(key,html)

3.清洗處理數據，得到准確信息

　　a.存在空格 ——string.strip()

　　b.分割 ——string.split()

　　c.存在某個有規律的符號 ——string.find()==-1

4.打印得到需要的信息點

完整的代碼如下：

#encoding:utf-8
import urllib.request
import re
#獲取單個電影鏈接里的內容，比如評分，劇情簡介
def get_score_information(link):
	html=urllib.request.urlopen(link)
	html=html.read().decode("utf-8")
	score_patten=re.compile(r'<strong class=".*" property=".*">(.*)</strong>')
	score_all=re.findall(score_patten,html)
	for score in score_all:
		print ("The movie score is :"+score)

count=1
url="https://movie.douban.com/top250?start="
urls=[url+str(num*25) for num in range(4)]#列表推導式
for one in urls:
	html=urllib.request.urlopen(one)
	html=html.read().decode("utf-8")
#提取電影名稱的正則表達式
	title_patten=re.compile(r'<span class="title">(.*?)</span>')
#提取電影鏈接內容的正式表達式
	link_patten=re.compile(r'<a href="(.*?)" class="">')
	title_all=re.findall(title_patten,html)
	link_all=re.findall(link_patten,html)
#因為得到的電影名稱有其它格式的存在，需要清洗整理到新的列表
	title_arr=[]
	for each in title_all:
		if each.find('/')==-1:
			title_arr.append(each)
	for title,link in zip(title_arr,link_all):
			print ("Top "+str(count)+":	"+title+"	link:	"+link)
			get_score_information(link)
			count+=1

存在的問題是：要是需要提取多個信息點，得編寫一段長度不可估的正則表達式，要不然就是分開提取正則表達式，再一一整理統一打印輸出。

（二）BeautifulSoup

#encoding:utf-8
import requests
from bs4 import BeautifulSoup
number=1
url='https://movie.douban.com/top250?start=0'
#獲取HTML內容
html=requests.get(url)
#輸出文本格式自動編碼
html=html.text
#實例化對象，html.parser 是解析html
soup=BeautifulSoup(html,'html.parser')
#找到爬取電影名稱的標簽
movie_title=soup.find_all("span",class_="title")
#清洗數據
title_arr=[]
for title in movie_title:
	if title.text.find('/')==-1:
		title_arr.append(title.text)
#注意link.a.attrs['href'] ，是找到標簽下的屬性，方便獲取信息點
#<a href="">
movie_link=soup.find_all('div',class_="hd")
for title_one ,link in zip(title_arr,movie_link):
	print ("Top "+str(number)+"	:	"+title_one+"	movie link:	"+link.a.attrs['href'])
	number+=1

總結：BeautifulSoup可以利用標簽方便獲取多個信息點

用BeautifulSoup 進行代碼重構（）：

encoding:utf-8
import requests
from bs4 import BeautifulSoup

#爬取電影鏈接后解析鏈接獲取電影評分以及劇情簡介
def get_information(link):
	html=requests.get(link)
	soup=BeautifulSoup(html.text,'html.parser')
	movie_score=soup.find_all("strong",class_="ll rating_num")
	for score in movie_score:	
		print ("The movie score is :		"+score.text)
	movie_detail=soup.find_all("div",class_="related-info")
	for detail in movie_detail:
　　　　　　　　　　#對爬取的電影劇情簡介進行字符串的處理
		print ("The movie detail is  :		"+detail.span.text.split('\n')[1].strip()+detail.span.text.split('\n')[2].strip())

number=1		
for i in range(4):#爬取top100 ,4個頁面。爬取top250,10個頁面
	url='https://movie.douban.com/top250?start={}'.format(i*25)
	html=requests.get(url)
	soup=BeautifulSoup(html.text,'html.parser')
	#movie_title=[]
	#movie_link=[]
	movie_all=soup.find_all("div",class_="hd")
	for each in movie_all:
		#print (each.a.span.text)
		print ("-"*100)
		print ("The movie is Top  :		"+str(number))
		print ("The movie name is :		"+each.a.span.text)
		print ("The movie link is :		"+each.a['href'])
		get_information(each.a['href'])
		number+=1

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬蟲實戰（一）——requests+正則表達式爬取貓眼TOP100 正則表達式_爬取豆瓣電影排行Top250 Python爬蟲：使用正則表達式爬取網站電影信息 python3爬蟲爬取貓眼電影TOP100（含詳細爬取思路） python爬蟲：爬取貓眼TOP100榜的100部高分經典電影爬蟲之正則表達式的應用爬取 Python爬蟲項目--爬取貓眼電影Top100榜爬蟲實戰01——爬取貓眼電影top100榜單爬取貓眼電影榜單TOP100 python 爬取貓眼電影top100數據