python簡易爬蟲，幫助理解re模塊

本文轉載自查看原文 2016-09-30 23:40 1987 python/ BS4/ BeautifulSoup4/ find_all()/ html爬蟲解析/ mysql

20161203更新：

1.使用了BS4解析html

2.使用了mysql-connector插入了數據庫表

pip install mysql-connector

import urllib.request  
from bs4 import BeautifulSoup  
import re  
import mysql.connector

def getMovieInfo():  
    url="https://movie.douban.com"  
    data=urllib.request.urlopen(url).read()  
    page_data=data.decode('UTF-8')  
    '''''print(page_data)'''  
　　
    soup=BeautifulSoup(page_data,"html.parser")

    #連接mysql
    conn = mysql.connector.connect(host='locahost',user='root',password='888888',database='test')
    cursor = conn.cursor()
    cursor.execute('delete from doubanmovie where 1=1')
    

    for link in soup.findAll('li',attrs={"data-actors": True}):
        moviename=link['data-title']
        actors = link['data-actors']
        region=link['data-region']
        release=link['data-release']
        duration = link['data-duration']
        director = link['data-director']
        rate = link['data-rate']
        imgsrc =link.img['src']
        
        cursor.execute("INSERT INTO doubanmovie VALUES ('', %s, %s, %s, %s, %s, %s, %s, %s,now())",[moviename,actors,region,release,duration,director,rate,imgsrc])

        conn.commit()
        
        print('mysql',cursor.rowcount)
        print(link['data-title'])
        print('演員：',link['data-actors'])
        print(link.img['src'])
    cursor.close()
    conn.close()
          
      
#函數調用  
getMovieInfo()

更新：基於python3的爬蟲教程

兩個版本代碼區別：

1.在3中，urllib.urlopen變成urllib.request.urlopen,之前的都要加request

2.在3中，print后面要加(),即輸出代碼：print()

3.在3中，

html = urllib.request.urlopen(url).read()返回的是byte類型，字節碼，需要轉換成UTF-8，
代碼：html = html.decode('utf-8')

#coding=utf-8
import urllib.request
import re

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    html =html.decode('utf-8')
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'%s.jpg' % x)
        x+=1


html = getHtml("http://tieba.baidu.com/p/2460150866")

print (getImg(html))

以下是基於python2的：

把篩選的圖片地址通過for循環遍歷並保存到本地，代碼如下：

#coding=utf-8
import urllib
import re

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    x = 0
    for imgurl in imglist:
        urllib.urlretrieve(imgurl,'%s.jpg' % x)
        x+=1


html = getHtml("http://tieba.baidu.com/p/2460150866")

print getImg(html)

　　這里的核心是用到了urllib.urlretrieve()方法，直接將遠程數據下載到本地。

　我們又創建了getImg()函數，用於在獲取的整個頁面中篩選需要的圖片連接。re模塊主要包含了正則表達式：

　　re.compile() 可以把正則表達式編譯成一個正則表達式對象.

　　re.findall() 方法讀取html 中包含 imgre（正則表達式）的數據。

　　運行腳本將得到整個頁面中包含圖片的URL地址。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python-爬蟲之re模塊再談EPLAN 中的項目結構-幫助理解 python的re模塊理解（re.compile、re.match、re.search） Python爬蟲基礎——re模塊的提取、匹配和替換 python模塊&&模塊re python正則表達式re模塊理解（RE.COMPILE、RE.MATCH、RE.SEARCH）（六） Python網絡爬蟲與信息提取（三）—— Re模塊 Python--re模塊 Python re模塊 python內置模塊[re]