數據分析實例-MovieLens 1M 數據集


        MovieLens 1M數據集含有來自6000名用戶對4000部電影的100萬條評分數據。分為三個表:評分,用戶信息,電影信息。這些數據都是dat文件格式。


讀取3個數據集:

#coding=gbk # MovieLens 1M數據集含有來自6000名用戶對4000部電影的100萬條評分數據。 # 分為三個表:評分,用戶信息,電影信息。這些數據都是dat文件格式 # ,可以通過pandas.read_table將各個表分別讀到一個pandas DataFrame對象中 import pandas as pd import time start = time.clock() filename1 =r'D:\datasets\users.dat' filename2 = r'D:\datasets\ratings.dat' filename3 = r'D:\datasets\movies.dat' pd.options.display.max_rows = 10 uname = ['user_id','gender','age','occupation','zip'] users = pd.read_table(filename1, sep='::', header = None, names=uname, engine='python') print(users.head()) #年齡和職業都是使用編碼的形式給出來的 # user_id gender age occupation zip # 0 1 F 1 10 48067 # 1 2 M 56 16 70072 # 2 3 M 25 15 55117 # 3 4 M 45 7 02460 # 4 5 M 25 20 55455 print(users.shape) # (6040, 5) rnames = ['user_id','movie_id','rating','timestamp'] ratings = pd.read_table(filename2, header =None, sep='::',names=rnames, engine= 'python') print(ratings.head()) # user_id movie_id rating timestamp # 0 1 1193 5 978300760 # 1 1 661 3 978302109 # 2 1 914 3 978301968 # 3 1 3408 4 978300275 # 4 1 2355 5 978824291 # print(ratings.shape) #(1000209, 4) mnames = ['movie_id','title','genres'] # genres 表示影片的體裁是什么 movies = pd.read_table(filename3, header = None, sep='::', names = mnames, engine='python') # print(movies.head()) # movie_id title genres # 0 1 Toy Story (1995) Animation|Children's|Comedy # 1 2 Jumanji (1995) Adventure|Children's|Fantasy # 2 3 Grumpier Old Men (1995) Comedy|Romance # 3 4 Waiting to Exhale (1995) Comedy|Drama # 4 5 Father of the Bride Part II (1995) Comedy # print(movies.shape) #(3883, 3)

年齡和職業都是使用編碼的形式給出來的:

- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

使用merge 函數將3個表進行合並

#使用merge 函數將3個表進行合並 data = pd.merge(pd.merge(ratings, users), movies) # print(data.head()) # user_id movie_id rating timestamp gender age occupation zip \.. # 0 1 1193 5 978300760 F 1 10 48067 # 1 2 1193 5 978298413 M 56 16 70072 # 2 12 1193 4 978220179 M 25 12 32793 # 3 15 1193 4 978199279 M 25 7 22903 # 4 17 1193 5 978158471 M 50 1 95350 # print(data.iloc[0]) # user_id 1 # movie_id 1193 # rating 5 # timestamp 978300760 # gender F # age 1 # occupation 10 # zip 48067 # title One Flew Over the Cuckoo's Nest (1975) # genres Drama # Name: 0, dtype: object

使用透視表,按性別計算每部電影的平均得分

#index 表示索引,values表示所要進行分析的數據, columns允許選擇一個或多個列,以columns作為分組的列 mean_ratings = data.pivot_table(values ='rating', index='title', columns ='gender', aggfunc='mean') # print(mean_ratings.head()) # gender F M # title # $1,000,000 Duck (1971) 3.375000 2.761905 # 'Night Mother (1986) 3.388889 3.352941 # 'Til There Was You (1997) 2.675676 2.733333 # 'burbs, The (1989) 2.793478 2.962085 # ...And Justice for All (1979) 3.828571 3.689024

使用選擇的數據進行分析

#過濾掉評分數據不足250 條的電影 ratings_by_title = data.groupby('title').size() print(ratings_by_title[:3]) # title # $1,000,000 Duck (1971) 37 # 'Night Mother (1986) 70 # 'Til There Was You (1997) 52 # dtype: int64 active_titles = ratings_by_title.index[ratings_by_title >= 250] #找出其評論大於250 的索引 print(active_titles[:3]) # Index([''burbs, The (1989)', '10 Things I Hate About You (1999)', # '101 Dalmatians (1961)'], # dtype='object', name='title') #可以以active_titles 中的電影作為索引,選擇出 mean_ratings 中的電影 mean_ratings = mean_ratings.loc[active_titles] print(mean_ratings[:5]) # gender F M # title # 'burbs, The (1989) 2.793478 2.962085 # 10 Things I Hate About You (1999) 3.646552 3.311966 # 101 Dalmatians (1961) 3.791444 3.500000 # 101 Dalmatians (1996) 3.240000 2.911215 # 12 Angry Men (1957) 4.184397 4.328421 #查看女性觀眾喜歡的電影,可以按 F 列進行降序排列 top_ratings = mean_ratings.sort_values(by="F", ascending = False) print(top_ratings[:3]) # gender F M # title # Close Shave, A (1995) 4.644444 4.473795 # Wrong Trousers, The (1993) 4.588235 4.478261 # Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 #計算男性觀眾和女性觀眾分歧最大的電影 mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] sort_by_diff = mean_ratings.sort_values(by='diff') print(sort_by_diff[:3]) # gender F M diff # title # Dirty Dancing (1987) 3.790378 2.959596 -0.830782 # Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 # Grease (1978) 3.975265 3.367041 -0.608224 #對行進行反序操作, 取出前3行,得到是男性更喜歡的電影,而女性觀眾相反 print(sort_by_diff[::-1][:3]) # gender F M diff # title # Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351 # Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 # Dumb & Dumber (1994) 2.697987 3.336595 0.638608 #計算得分數據的標准差,找出分歧最大的電影 rating_std = data.groupby('title')['rating'].std() rating_std = rating_std.loc[active_titles] print(rating_std.sort_values(ascending=False)[:3]) # title # Dumb & Dumber (1994) 1.321333 # Blair Witch Project, The (1999) 1.316368 # Natural Born Killers (1994) 1.307198 # Name: rating, dtype: float64 end = time.clock() spending_time = end - start print('花費的時間為:%.2f'%spending_time + 's') # 花費的時間為:11.13s

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM