最近被種草SK-II,本着學工科的嚴謹態度,決定用數據說話
爬取數據
參數解析
itemId是商品ID, sellerId 是賣家ID, currentPage是當前頁碼,目標url是https://rate.tmall.com/list_detail_rate.htm?itemId=15332134505&spuId=294841&sellerId=917264765&order=3¤tPage=1
正則解析
1.cnt字符串不要隨便換行(否則可能報錯:SyntaxError: EOL while scanning string literal),
2.findall(正則規則,字符串) 方法能夠以列表的形式返回能匹配的字符串
#coding=utf-8 import re cnt = '"aliMallSeller":False,"anony":True,"appendComment":"","attributes":"","attributesMap":"","aucNumId":"","auctionPicUrl":"","auctionPrice":"","auctionSku":"化妝品凈含量:75ml","auctionTitle":"","buyCount":0,"carServiceLocation":"","cmsSource":"天貓","displayRatePic":"","displayRateSum":0,"displayUserLink":"","displayUserNick":"t***凱","displayUserNumId":"","displayUserRateLink":"","dsr":0.0,"fromMall":True,"fromMemory":0,"gmtCreateTime":1504930533000,"goldUser":False,"id":322848226237,"pics":["//img.alicdn.com/bao/uploaded/i3/2699329812/TB2hr6keQ.HL1JjSZFlXXaiRFXa_!!0-rate.jpg"],"picsSmall":"","position":"920-11-18,20;","rateContent":"送了面膜 和晶瑩水 skii就是A 不錯","rateDate":"2017-09-09 12:15:33","reply":"一次偶然的機會,遇見了親,一次偶然的機會,親選擇了SK-II,生命中有太多的選擇,親的每一次選擇都是一種緣分。讓SK-II與您形影不離,任歲月洗禮而秀美如初~每日清晨拉開窗簾迎來的不僅止破曉曙光,還有嶄新的自己~【SK-II官方旗艦店Lily】","sellerId":917264765,"serviceRateContent":"","structuredRateList":[],"tamllSweetLevel":3,"tmallSweetPic":"tmall-grade-t3-18.png","tradeEndTime":1504847657000,"tradeId":"","useful":True,"userIdEncryption":"","userInfo":"","userVipLevel":0,"userVipPic":""' nickname = [] regex = re.compile('"displayUserNick":"(.*?)"') print regex nk = re.findall(regex,cnt) for i in nk: print i nickname.extend(nk) print nickname ak = re.findall('"auctionSku":"(.*?)"',cnt) for j in ak: print j rc = re.findall('"rateContent":"(.*?)"',cnt) for n in rc: print n rd = re.findall('"rateDate":"(.*?)"',cnt) for m in rd: print m
輸出:
完整源碼
參考:http://www.jianshu.com/p/632a3d3b15c2
#coding=utf-8 import requests import re import sys reload(sys) sys.setdefaultencoding('utf-8') #urls = [] #for i in list(range(1,500)): # urls.append('https://rate.tmall.com/list_detail_rate.htm?itemId=15332134505&spuId=294841&sellerId=917264765&order=1¤tPage=%s'%i) tmpt_url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=15332134505&spuId=294841&sellerId=917264765&order=1¤tPage=%d' urllist = [tmpt_url%i for i in range(1,100)] #print urllist nickname = [] auctionSku = [] ratecontent = [] ratedate = [] headers = '' for url in urllist: content = requests.get(url).text nk = re.findall('"displayUserNick":"(.*?)"',content) #findall(正則規則,字符串) 方法能夠以列表的形式返回能匹配的字符串 #print nk nickname.extend(nk) auctionSku.extend(re.findall('"auctionSku":"(.*?)"',content)) ratecontent.extend(re.findall('"rateContent":"(.*?)"',content)) ratedate.extend(re.findall('"rateDate":"(.*?)"',content)) print (nickname,ratedate) for i in list(range(0,len(nickname))): text =','.join((nickname[i],ratedate[i],auctionSku[i],ratecontent[i]))+'\n' with open(r"C:\Users\HP\Desktop\codes\DATA\SK-II_TmallContent.csv",'a+') as file: file.write(text+' ') print("寫入成功")
注:url每次遍歷,正則匹配的數據都不止一個,所以使用extend追加而不是append
輸出:
數據分析
1.要不要買——評論分析
import pandas as pd from pandas import Series,DataFrame import jieba from collections import Counter df = pd.read_csv(r'C:/Users/HP/Desktop/codes/DATA/SK-II_TmallContent.csv',encoding='gbk') #否則中文亂碼 #print df.columns df.columns = ['useName','date','type','content'] #print df[:10] tlist = Series.as_matrix(df['content']).tolist() text = [i for i in tlist if type(i)!= float] #if type(i)!= float一定得加不然報錯 text = ' '.join(text) #print text wordlist_jieba = jieba.cut(text,cut_all=True) stoplist = {}.fromkeys([u'的', u'了', u'是',u'有']) #自定義中文停詞表,注意得是unicode print stoplist wordlist_jieba = [i for i in wordlist_jieba if i not in stoplist] #and len(i) > 1 #print u"[全模式]: ", "/ ".join(wordlist_jieba) count = Counter(wordlist_jieba) #統計出現次數,以字典的鍵值對形式存儲,元素作為key,其計數作為value。 result = sorted(count.items(), key=lambda x: x[1], reverse=True) #key=lambda x: x[1]在此表示用次數作為關鍵字 for word in result: print word[0], word[1] from pyecharts import WordCloud data = dict(result[:100]) wordcloud = WordCloud('高頻詞雲',width = 800,height = 600) wordcloud.add('ryana',data.keys(),data.values(),word_size_range = [30,300]) wordcloud
輸出:
好用的頻率占據榜首,只是不明白為什么要切分
2.買什么——類型分析
import pandas as pd from pandas import Series,DataFrame df = pd.read_csv(r'C:/Users/HP/Desktop/codes/DATA/SK-II_TmallContent.csv',encoding='gbk') #否則中文亂碼 #print df.columns df.columns = ['useName','date','type','content'] print df[:5] from pyecharts import Pie pie = Pie('凈含量購買分布') v = df['type'].tolist() print v[:5] #n1 = v.count(u'\u5316\u5986\u54c1\u51c0\u542b\u91cf:230ml') n1 = v.count(u'化妝品凈含量:75ml') n2 = v.count(u'化妝品凈含量:160ml') n3 = v.count(u'化妝品凈含量:230ml') n4 = v.count(u'化妝品凈含量:330ml') #print n1,n2,n3,n4 #800 87 808 124 N = [n1,n2,n3,n4] #print N #[800,87,808,124] attr = ['體驗裝','暢銷經典','忠粉摯愛','屯貨之選'] pie.add('ryana',attr,N,is_label_show = True) pie
輸出: