爬取的目標網址:https://music.douban.com/top250
利用lxml庫,獲取前10頁的信息,需要爬取的信息包括歌曲名、表演者、流派、發行時間、評分和評論人數,把這些信息存到csv和xls文件
在爬取的數據保存到csv文件時,有可能每一行數據后都會出現空一行,查閱資料后,發現用newline=''可解決,但又會出現錯誤:'gbk' codec can't encode character '\xb3' in position 1: illegal multibyte sequence,然后可用encoding = "gb18030"解決
代碼:
import xlwt import csv import requests from lxml import etree import time list_music = [] #請求頭 headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"} #爬取的內容保存在csv文件中 f = open(r"D:\Python爬蟲\doubanmusic.csv","w+",newline = '',encoding = "gb18030") writer = csv.writer(f,dialect = 'excel') #csv文件中第一行寫入標題 writer.writerow(["song","singer","time","liupai","mark","coment"]) #定義爬取內容的函數 def music_info(url): html = requests.get(url,headers=headers) selector = etree.HTML(html.text) infos = selector.xpath('//tr[@class="item"]') for info in infos: song = info.xpath('td[2]/div/a/text()')[0].strip() singer = info.xpath('td[2]/div/p/text()')[0].split("/")[0] times = info.xpath('td[2]/div/p/text()')[0].split("/")[1] liupai = info.xpath('td[2]/div/p/text()')[0].split("/")[-1] mark = info.xpath('td[2]/div/div/span[2]/text()')[0].strip() coment = info.xpath('td[2]/div/div/span[3]/text()')[0].strip().strip("(").strip(")").strip() list_info = [song,singer,times,liupai,mark,coment] writer.writerow([song,singer,times,liupai,mark,coment]) list_music.append(list_info) #防止請求頻繁,故睡眠1秒 time.sleep(1) if __name__ == '__main__': urls = ['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)] #調用函數music_info循環爬取內容 for url in urls: music_info(url) #關閉csv文件 f.close() #爬取的內容保存在xls文件中 header = ["song","singer","time","liupai","mark","coment"] #打開工作簿 book = xlwt.Workbook(encoding='utf-8') #建立Sheet1工作表 sheet = book.add_sheet('Sheet1') for h in range(len(header)): sheet.write(0, h, header[h]) i = 1 for list in list_music: j = 0 for data in list: sheet.write(i, j, data) j += 1 i += 1 #保存文件 book.save('doubanmusic.xls')
結果部分截圖: