20161203更新:
1.使用了BS4解析html
2.使用了mysql-connector插入了數據庫表
pip install mysql-connector
import urllib.request from bs4 import BeautifulSoup import re import mysql.connector def getMovieInfo(): url="https://movie.douban.com" data=urllib.request.urlopen(url).read() page_data=data.decode('UTF-8') '''''print(page_data)'''
soup=BeautifulSoup(page_data,"html.parser") #連接mysql conn = mysql.connector.connect(host='locahost',user='root',password='888888',database='test') cursor = conn.cursor() cursor.execute('delete from doubanmovie where 1=1') for link in soup.findAll('li',attrs={"data-actors": True}): moviename=link['data-title'] actors = link['data-actors'] region=link['data-region'] release=link['data-release'] duration = link['data-duration'] director = link['data-director'] rate = link['data-rate'] imgsrc =link.img['src'] cursor.execute("INSERT INTO doubanmovie VALUES ('', %s, %s, %s, %s, %s, %s, %s, %s,now())",[moviename,actors,region,release,duration,director,rate,imgsrc]) conn.commit() print('mysql',cursor.rowcount) print(link['data-title']) print('演員:',link['data-actors']) print(link.img['src']) cursor.close() conn.close() #函數調用 getMovieInfo()
更新:基於python3的爬蟲教程
兩個版本代碼區別:
1.在3中,urllib.urlopen變成urllib.request.urlopen,之前的都要加request
2.在3中,print后面要加(),即輸出代碼:print()
3.在3中,
html = urllib.request.urlopen(url).read()返回的是byte類型,字節碼,需要轉換成UTF-8,
代碼:html = html.decode('utf-8')
#coding=utf-8 import urllib.request import re def getHtml(url): page = urllib.request.urlopen(url) html = page.read() html =html.decode('utf-8') return html def getImg(html): reg = r'src="(.+?\.jpg)" pic_ext' imgre = re.compile(reg) imglist = re.findall(imgre,html) x = 0 for imgurl in imglist: urllib.request.urlretrieve(imgurl,'%s.jpg' % x) x+=1 html = getHtml("http://tieba.baidu.com/p/2460150866") print (getImg(html))
以下是基於python2的:
把篩選的圖片地址通過for循環遍歷並保存到本地,代碼如下:
#coding=utf-8 import urllib import re def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): reg = r'src="(.+?\.jpg)" pic_ext' imgre = re.compile(reg) imglist = re.findall(imgre,html) x = 0 for imgurl in imglist: urllib.urlretrieve(imgurl,'%s.jpg' % x) x+=1 html = getHtml("http://tieba.baidu.com/p/2460150866") print getImg(html)
這里的核心是用到了urllib.urlretrieve()方法,直接將遠程數據下載到本地。
我們又創建了getImg()函數,用於在獲取的整個頁面中篩選需要的圖片連接。re模塊主要包含了正則表達式:
re.compile() 可以把正則表達式編譯成一個正則表達式對象.
re.findall() 方法讀取html 中包含 imgre(正則表達式)的數據。
運行腳本將得到整個頁面中包含圖片的URL地址。