代碼中用了bs4和requests這兩個包,這里主要提供下代碼,視頻教程我建議去https://www.bilibili.com/video/av14109284/?p=1觀看,個人覺得課程很棒!
from bs4 import BeautifulSoup import requests import time import xlwt urls = ['http://list.iqiyi.com/www/1/4-------------11-{}-1-iqiyi--.html'.format(str(i)) for i in range(1,2)] def get_webData(url): data_info = [] time.sleep(3) wb_data = requests.get(url) soup = BeautifulSoup(wb_data.text,'lxml') titles = soup.select('div.site-piclist_info > div.mod-listTitle_left > p > a') imgs = soup.select('div.site-piclist_pic > a > img') actors = soup.select('div.site-piclist_info > div.role_info') spans = soup.select('div.site-piclist_info > div.mod-listTitle_left > span') for title,img,actors,span in zip(titles,imgs,actors,spans): data = { 'title':title.get_text(), 'img':img.get('src'), 'actors':list(actors.stripped_strings), 'span':span.get_text() } #print(data) dataVal_list.append(data.values()) data_info.append(data) #print(list(data.keys())) for m,n in enumerate(list(data.keys())): sheet.write(0,m,n) for i,p in enumerate(dataVal_list): for j,q in enumerate(p): #print(q) sheet.write(i+1,j,q) return data_info def main(): for url in urls: data_info = get_webData(url) #print(data_info) print('---------------------------\n') if __name__ == '__main__': f = xlwt.Workbook(encoding='utf-8') sheet = f.add_sheet('Movies') dataVal_list=[] main() f.save('Movies.xls')
這里存在一個小問題就是寫數據到excel中這里的代碼只能寫一頁數據進去,后面會報錯因為sheet.write()這個函數前兩個參數代表excel中的行和列,這部分數據處理放在函數中會導致每次執行這個函數時sheet.write()中的行和列都是從0開始,所以會報錯。爬取信息的代碼是沒有問題的,爬取不同頁需要去觀察網址的區別。下面貼上excel文件的結果。