python2.7爬取豆瓣電影top250並分別寫入到TXT，Excel，MySQL數據庫

1.任務

爬取豆瓣電影top250
以txt文件保存
以Excel文檔保存
將數據錄入數據庫

2.分析

電影中文名的采集可以查看：http://www.cnblogs.com/carpenterworm/p/6026274.html

電影鏈接采集：

可以看到電影鏈接放在<a class="" href=" ">中，因此可以用re.compile(r'<a class="" href="(.*)">')進行匹配。

如果re.compile(r'<a class="" href="https://movie.douban.com/subject/(.*)/">')進行匹配，得到的結果就是諸如：1292052這樣的數字，而不是鏈接了。

3.程序

總的來說，程序比較簡單，這里不再說明，直接上程序。

#!/usr/bin/python
# -*- coding: utf-8 -*- #
import requests,sys,re,openpyxl,MySQLdb,time
from bs4 import BeautifulSoup

reload(sys)
sys.setdefaultencoding('utf-8')
print '正在從豆瓣電影Top250抓取數據......'
# --------------------------創建列表用於存放數據-----------------------------#
nameList=[]
linkList=[]

#---------------------------------爬取模塊------------------------------------#
def topMovie():
    for page in range(10):
        url='https://movie.douban.com/top250?start='+str(page*25)
        print '正在爬取第---'+str(page+1)+'---頁......'
        html=requests.get(url)
        html.raise_for_status()
        try:
            soup=BeautifulSoup(html.text,'html.parser')
            soup=str(soup) # 利用正則表達式需要將網頁文本轉換成字符串
            name=re.compile(r'<span class="title">(.*)</span>')
            links=re.compile(r'<a class="" href="(.*)">')
            movieNames=re.findall(name,soup)
            movieLinks=re.findall(links,soup)
            for name in movieNames:
                if name.find('/')==-1: # 剔除英文名(英文名特征是含有'/')
                    nameList.append(name)
            for link in movieLinks:
                linkList.append(link)
        except Exception as e:
            print e
    print '爬取完畢！'
    return nameList,linkList

# ---------------------------------儲存為文本文件-----------------------------------#
def save_to_txt():
    print 'txt文件存儲中......'
    try:
        f=open('data.txt','w')
        for i in range(250):
            f.write(nameList[i])
            f.write('\t'*3)
            f.write(linkList[i])
            f.write('\n')
        f.close()
    except Exception as e:
        print e
    print 'txt文件存儲結束！'

# ---------------------------------儲存為excel文件-----------------------------------#
def save_to_Excel():
    print 'Excel文件存儲中......'
    try:
        wb=openpyxl.Workbook()
        sheet=wb.get_active_sheet()
        sheet.title='Movie Top 250'
        for i in range(1,251):
            one='a'+str(i)
            two='b'+str(i)
            sheet[one]=nameList[i-1]
            sheet[two]=linkList[i-1]
        wb.save(ur'豆瓣電影Top250.xlsx') # 保證文件名為中文
    except Exception as e:
        print e
    print 'Excel文件存儲結束！'

# ---------------------------------儲存到文數據庫-----------------------------------#
def save_to_MySQL():
    print 'MySQL數據庫存儲中......'
    try:
        conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="******", db="test", charset="utf8")
        cursor = conn.cursor()
        print "數據庫連接成功"
        cursor.execute('Drop table if EXISTS MovieTop250') # 如果表存在就刪除
        time.sleep(3)
        cursor.execute('''create table if not EXISTS MovieTop250(
                           movieName VARCHAR (200),
                           link VARCHAR (200))''')
        for i in range(250):
            sql='insert into MovieTop250(movieName,link) VALUES (%s,%s)'
            param=(nameList[i],linkList[i])
            cursor.execute(sql,param)
            conn.commit()
        cursor.close()
        conn.close()
    except Exception as e:
        print e
    print 'MySQL數據庫存儲結束！'

# -------------------------------------主模塊--------------------------------------#
if __name__=="__main__":
    try:
        topMovie()
        save_to_txt()
        save_to_Excel()
        save_to_MySQL()
    except Exception as e:
        print e

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬取豆瓣電影top250數據存入數據庫 python2.7抓取豆瓣電影top250 python爬取豆瓣電影top250數據存入excel python3爬取豆瓣top250電影爬取豆瓣電影Top250 python爬取豆瓣電影top250 python爬取豆瓣top250電影源碼 Python爬蟲——爬取豆瓣電影Top250 python3 爬蟲---爬取豆瓣電影TOP250 爬取豆瓣電影TOP250榜

python2.7爬取豆瓣電影top250並寫入到TXT，Excel，MySQL數據庫

python2.7爬取豆瓣電影top250並分別寫入到TXT，Excel，MySQL數據庫

1.任務

爬取豆瓣電影top250

以txt文件保存

以Excel文檔保存

將數據錄入數據庫

2.分析

電影中文名的采集可以查看：http://www.cnblogs.com/carpenterworm/p/6026274.html

電影鏈接采集：

3.程序

總的來說，程序比較簡單，這里不再說明，直接上程序。

免責聲明！