機器學習工程師 - Udacity 項目 0: 預測你的下一道世界料理


第一步. 下載並導入數據

1.1 數據集:https://www.kaggle.com/c/whats-cooking/data

1.2 加載數據

# 導入依賴庫
import json import codecs import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline # 加載數據集
train_filename='train.json' train_content = pd.read_json(codecs.open(train_filename, mode='r', encoding='utf-8')) test_filename = 'test.json' test_content = pd.read_json(codecs.open(test_filename, mode='r', encoding='utf-8')) # 打印加載的數據集數量
print("菜名數據集一共包含 {} 訓練數據 和 {} 測試樣例。\n".format(len(train_content), len(test_content))) if len(train_content)==39774 and len(test_content)==9944: print("數據成功載入!") else: print("數據載入有問題,請檢查文件路徑!")

菜名數據集一共包含 39774 訓練數據 和 9944 測試樣例。
數據成功載入!

1.3 數據預覽
為了查看我們的數據集的分布和菜品總共的種類,我們打印出部分數據樣例。

pd.set_option('display.max_colwidth',120)

編程練習
你需要通過head()函數來預覽訓練集train_content數據。(輸出前5條)

### TODO:打印train_content中前5個數據樣例以預覽數據
print(train_content.head())

cuisine id \
0 greek 10259
1 southern_us 25693
2 filipino 20130
3 indian 22213
4 indian 13162

ingredients
0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...

## 查看總共菜品分類
categories=np.unique(train_content['cuisine']) print("一共包含 {} 種菜品,分別是:\n{}".format(len(categories),categories))

一共包含 20 種菜品,分別是:
['brazilian' 'british' 'cajun_creole' 'chinese' 'filipino' 'french' 'greek'
'indian' 'irish' 'italian' 'jamaican' 'japanese' 'korean' 'mexican'
'moroccan' 'russian' 'southern_us' 'spanish' 'thai' 'vietnamese']

 

第二步. 分析數據
由於這個項目的最終目標是建立一個預測世界菜系的模型,我們需要將數據集分為特征(Features)和目標變量(Target Variables)。

特征: 'ingredients',給我們提供了每個菜品所包含的佐料名稱。
目標變量:'cuisine',是我們希望預測的菜系分類。
他們分別被存在 train_ingredients 和 train_targets 兩個變量名中。

編程練習:數據提取
將train_content中的ingredients賦值到train_integredients
將train_content中的cuisine賦值到train_targets

### TODO:將特征與目標變量分別賦值
train_ingredients = train_content['ingredients'] train_targets = train_content['cuisine'] ### TODO: 打印結果,檢查是否正確賦值
print(train_ingredients) print(train_targets)

0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...
5 [plain flour, sugar, butter, eggs, fresh ginger root, salt, ground cinnamon, milk, vanilla extract, ground ginger, p...
6 [olive oil, salt, medium shrimp, pepper, garlic, chopped cilantro, jalapeno chilies, flat leaf parsley, skirt steak,...
7 [sugar, pistachio nuts, white almond bark, flour, vanilla extract, olive oil, almond extract, eggs, baking powder, d...
8 [olive oil, purple onion, fresh pineapple, pork, poblano peppers, corn tortillas, cheddar cheese, ground black peppe...
9 [chopped tomatoes, fresh basil, garlic, extra-virgin olive oil, kosher salt, flat leaf parsley]
10 [pimentos, sweet pepper, dried oregano, olive oil, garlic, sharp cheddar cheese, pepper, swiss cheese, provolone che...
11 [low sodium soy sauce, fresh ginger, dry mustard, green beans, white pepper, sesame oil, scallions, canola oil, suga...
12 [Italian parsley leaves, walnuts, hot red pepper flakes, extra-virgin olive oil, fresh lemon juice, trout fillet, ga...
13 [ground cinnamon, fresh cilantro, chili powder, ground coriander, kosher salt, ground black pepper, garlic, plum tom...
14 [fresh parmesan cheese, butter, all-purpose flour, fat free less sodium chicken broth, chopped fresh chives, gruyere...
15 [tumeric, vegetable stock, tomatoes, garam masala, naan, red lentils, red chili peppers, onions, spinach, sweet pota...
16 [greek yogurt, lemon curd, confectioners sugar, raspberries]
17 [italian seasoning, broiler-fryer chicken, mayonaise, zesty italian dressing]
18 [sugar, hot chili, asian fish sauce, lime juice]
19 [soy sauce, vegetable oil, red bell pepper, chicken broth, yellow squash, garlic chili sauce, sliced green onions, b...
20 [pork loin, roasted peanuts, chopped cilantro fresh, hoisin sauce, creamy peanut butter, chopped fresh mint, thai ba...
21 [roma tomatoes, kosher salt, purple onion, jalapeno chilies, lime, chopped cilantro]
22 [low-fat mayonnaise, pepper, salt, baking potatoes, eggs, spicy brown mustard]
23 [sesame seeds, red pepper, yellow peppers, water, extra firm tofu, broccoli, soy sauce, orange bell pepper, arrowroo...
24 [marinara sauce, flat leaf parsley, olive oil, linguine, capers, crushed red pepper flakes, olives, lemon zest, garlic]
25 [sugar, lo mein noodles, salt, chicken broth, light soy sauce, flank steak, beansprouts, dried black mushrooms, pepp...
26 [herbs, lemon juice, fresh tomatoes, paprika, mango, stock, chile pepper, onions, red chili peppers, oil]
27 [ground black pepper, butter, sliced mushrooms, sherry, salt, grated parmesan cheese, heavy cream, spaghetti, chicke...
28 [green bell pepper, egg roll wrappers, sweet and sour sauce, corn starch, molasses, vegetable oil, oil, soy sauce, s...
29 [flour tortillas, cheese, breakfast sausages, large eggs]
...
39744 [extra-virgin olive oil, oregano, potatoes, garlic cloves, pepper, salt, yellow mustard, fresh lemon juice]
39745 [quinoa, extra-virgin olive oil, fresh thyme leaves, scallion greens]
39746 [clove, bay leaves, ginger, chopped cilantro, ground turmeric, white onion, cinnamon, cardamom pods, serrano chile, ...
39747 [water, sugar, grated lemon zest, butter, pitted date, blanched almonds]
39748 [sea salt, pizza doughs, all-purpose flour, cornmeal, extra-virgin olive oil, shredded mozzarella cheese, kosher sal...
39749 [kosher salt, minced onion, tortilla chips, sugar, tomato juice, cilantro leaves, avocado, lime juice, roma tomatoes...
39750 [ground black pepper, chicken breasts, salsa, cheddar cheese, pepper jack, heavy cream, red enchilada sauce, unsalte...
39751 [olive oil, cayenne pepper, chopped cilantro fresh, boneless chicken skinless thigh, fine sea salt, low salt chicken...
39752 [self rising flour, milk, white sugar, butter, peaches in light syrup]
39753 [rosemary sprigs, lemon zest, garlic cloves, ground black pepper, vegetable broth, fresh basil leaves, minced garlic...
39754 [jasmine rice, bay leaves, sticky rice, rotisserie chicken, chopped cilantro, large eggs, vegetable oil, yellow onio...
39755 [mint leaves, cilantro leaves, ghee, tomatoes, cinnamon, oil, basmati rice, garlic paste, salt, coconut milk, clove,...
39756 [vegetable oil, cinnamon sticks, water, all-purpose flour, piloncillo, salt, orange zest, baking powder, hot water]
39757 [red bell pepper, garlic cloves, extra-virgin olive oil, feta cheese crumbles]
39758 [milk, salt, ground cayenne pepper, ground lamb, ground cinnamon, ground black pepper, pomegranate, chopped fresh mi...
39759 [red chili peppers, sea salt, onions, water, chilli bean sauce, caster sugar, garlic, white vinegar, chili oil, cucu...
39760 [butter, large eggs, cornmeal, baking powder, boiling water, milk, salt]
39761 [honey, chicken breast halves, cilantro leaves, carrots, soy sauce, Sriracha, wonton wrappers, freshly ground pepper...
39762 [curry powder, salt, chicken, water, vegetable oil, basmati rice, eggs, finely chopped onion, lemon juice, pepper, m...
39763 [fettuccine pasta, low-fat cream cheese, garlic, nonfat evaporated milk, grated parmesan cheese, corn starch, nonfat...
39764 [chili powder, worcestershire sauce, celery, red kidney beans, lean ground beef, stewed tomatoes, dried parsley, pep...
39765 [coconut, unsweetened coconut milk, mint leaves, plain yogurt]
39766 [rutabaga, ham, thick-cut bacon, potatoes, fresh parsley, salt, onions, pepper, carrots, pork sausages]
39767 [low-fat sour cream, grated parmesan cheese, salt, dried oregano, low-fat cottage cheese, butter, onions, olive oil,...
39768 [shredded cheddar cheese, crushed cheese crackers, cheddar cheese soup, cream of chicken soup, hot sauce, diced gree...
39769 [light brown sugar, granulated sugar, butter, warm water, large eggs, all-purpose flour, whole wheat flour, cooking ...
39770 [KRAFT Zesty Italian Dressing, purple onion, broccoli florets, rotini, pitted black olives, Kraft Grated Parmesan Ch...
39771 [eggs, citrus fruit, raisins, sourdough starter, flour, hot tea, sugar, ground nutmeg, salt, ground cinnamon, milk, ...
39772 [boneless chicken skinless thigh, minced garlic, steamed white rice, baking powder, corn starch, dark soy sauce, kos...
39773 [green chile, jalapeno chilies, onions, ground black pepper, salt, chopped cilantro fresh, green bell pepper, garlic...
Name: ingredients, Length: 39774, dtype: object
0 greek
1 southern_us
2 filipino
3 indian
4 indian
5 jamaican
6 spanish
7 italian
8 mexican
9 italian
10 italian
11 chinese
12 italian
13 mexican
14 italian
15 indian
16 british
17 italian
18 thai
19 vietnamese
20 thai
21 mexican
22 southern_us
23 chinese
24 italian
25 chinese
26 cajun_creole
27 italian
28 chinese
29 mexican
...
39744 greek
39745 spanish
39746 indian
39747 moroccan
39748 italian
39749 mexican
39750 mexican
39751 moroccan
39752 southern_us
39753 italian
39754 vietnamese
39755 indian
39756 mexican
39757 greek
39758 greek
39759 korean
39760 southern_us
39761 chinese
39762 indian
39763 italian
39764 mexican
39765 indian
39766 irish
39767 italian
39768 mexican
39769 irish
39770 italian
39771 irish
39772 chinese
39773 mexican
Name: cuisine, Length: 39774, dtype: object

編程練習:基礎統計運算
使用最頻繁的佐料前10分別有哪些?
意大利菜中最常見的10個佐料有哪些?

## TODO: 統計佐料出現次數,並賦值到sum_ingredients字典中
m = [] for i in range(len(train_ingredients)): m += train_ingredients[i] sum_ingredients = pd.Series(m).value_counts().to_dict()

or:

from collections import defaultdict sum_ingredients = defaultdict(int) for row in train_ingredients: for item in row: sum_ingredients[item] += 1 sum_ingredients = dict(sum_ingredients)
# Finally, plot the 10 most used ingredients
plt.style.use(u'ggplot') fig = pd.DataFrame(sum_ingredients, index=[0]).transpose()[0].sort_values(ascending=False, inplace=False)[:10].plot(kind='barh') fig.invert_yaxis() fig = fig.get_figure() fig.tight_layout()

## TODO: 統計意大利菜系中佐料出現次數,並賦值到italian_ingredients字典中
list_italian = train_content.loc[train_content['cuisine'].isin(['italian'])]['ingredients'].reset_index(drop=True) n = [] for j in range(len(list_italian)): n += list_italian[j] italian_ingredients = pd.Series(n).value_counts().to_dict()

or:

cuisine_ingredients = zip(train_targets, train_ingredients) for cuisine, ingredients in cuisine_ingredients: if cuisine == 'italian': for item in ingredients: if item in italian_ingredients: italian_ingredients[item] += 1
            else: italian_ingredients[item] = 1

 

第三步. 建立模型

3.1 單詞清洗
由於菜品包含的佐料眾多,同一種佐料也可能有單復數、時態等變化,為了去除這類差異,我們考慮將ingredients 進行過濾

import re from nltk.stem import WordNetLemmatizer import numpy as np def text_clean(ingredients): #去除單詞的標點符號,只保留 a..z A...Z的單詞字符
    ingredients= np.array(ingredients).tolist() print("菜品佐料:\n{}".format(ingredients[9])) ingredients=[[re.sub('[^A-Za-z]', ' ', word) for word in component]for component in ingredients] print("去除標點符號之后的結果:\n{}".format(ingredients[9])) # 去除單詞的單復數,時態,只保留單詞的詞干
    lemma=WordNetLemmatizer() ingredients=[" ".join([ " ".join([lemma.lemmatize(w) for w in words.split(" ")]) for words in component])  for component in ingredients] print("去除時態和單復數之后的結果:\n{}".format(ingredients[9])) return ingredients print("\n處理訓練集...") train_ingredients = text_clean(train_content['ingredients']) print("\n處理測試集...") test_ingredients = text_clean(test_content['ingredients'])

處理訓練集...
菜品佐料:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra-virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除標點符號之后的結果:
['chopped tomatoes', 'fresh basil', 'garlic', 'extra virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除時態和單復數之后的結果:
chopped tomato fresh basil garlic extra virgin olive oil kosher salt flat leaf parsley

處理測試集...
菜品佐料:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除標點符號之后的結果:
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除時態和單復數之后的結果:
egg cherry date dark muscovado sugar ground cinnamon mixed spice cake vanilla extract self raising flour sultana rum raisin prune glace cherry butter port

3.2 特征提取
在該步驟中,我們將菜品的佐料轉換成數值特征向量。考慮到絕大多數菜中都包含salt, water, sugar, butter等,采用one-hot的方法提取的向量將不能很好的對菜系作出區分。我們將考慮按照佐料出現的次數對佐料做一定的加權,即:佐料出現次數越多,佐料的區分性就越低。我們采用的特征為TF-IDF,相關介紹內容可以參考:TF-IDF與余弦相似性的應用(一):自動提取關鍵詞

from sklearn.feature_extraction.text import TfidfVectorizer # 將佐料轉換成特征向量

# 處理 訓練集
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 1), analyzer='word', max_df=.57, binary=False, token_pattern=r"\w+",sublinear_tf=False) train_tfidf = vectorizer.fit_transform(train_ingredients).todense() ## 處理 測試集
test_tfidf = vectorizer.transform(test_ingredients)
train_targets=np.array(train_content['cuisine']).tolist() train_targets[:10]

['greek',
'southern_us',
'filipino',
'indian',
'indian',
'jamaican',
'spanish',
'italian',
'mexican',
'italian']

編程練習
這里我們為了防止前面步驟中累積的錯誤,導致以下步驟無法正常運行。我們在此檢查處理完的實驗數據是否正確,請打印train_tfidf和train_targets中前五個數據。

# 你需要通過head()函數來預覽訓練集train_tfidf,train_targets數據
print(train_tfidf[:5]) print(train_targets[:5])

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
['greek', 'southern_us', 'filipino', 'indian', 'indian']

3.3 驗證集划分
為了在實驗中大致估計模型的精確度我們將從原本的train_ingredients 划分出 20% 的數據用作valid_ingredients。

編程練習:數據分割與重排
調用train_test_split函數將訓練集划分為新的訓練集和驗證集,便於之后的模型精度觀測。

從sklearn.model_selection中導入train_test_split
將train_tfidf和train_targets作為train_test_split的輸入變量
設置test_size為0.2,划分出20%的驗證集,80%的數據留作新的訓練集。
設置random_state隨機種子,以確保每一次運行都可以得到相同划分的結果。(隨機種子固定,生成的隨機序列就是確定的)

### TODO:划分出驗證集
from sklearn.model_selection import train_test_split X_train , X_valid , y_train, y_valid = train_test_split(train_tfidf, train_targets, test_size = 0.2, random_state=0)

3.2 建立模型
調用 sklearn 中的邏輯回歸模型(Logistic Regression)。

編程練習:訓練模型

從sklearn.linear_model導入LogisticRegression
從sklearn.model_selection導入GridSearchCV, 參數自動搜索,只要把參數輸進去,就能給出最優的結果和參數,這個方法適合小數據集。
定義parameters變量:為C參數創造一個字典,它的值是從1至10的數組;
定義classifier變量: 使用導入的LogisticRegression創建一個分類函數;
定義grid變量: 使用導入的GridSearchCV創建一個網格搜索對象;將變量'classifier', 'parameters'作為參數傳至這個對象構造函數中;

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV ## TODO: 建立邏輯回歸模型
parameters = {'C':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]} classifier = LogisticRegression() grid = GridSearchCV(classifier, parameters) grid = grid.fit(X_train, y_train)

模型訓練結束之后,我們計算模型在驗證集X_valid上預測結果,並計算模型的預測精度(與y_valid逐個比較)。

from sklearn.metrics import accuracy_score ## 計算模型的准確率
 valid_predict = grid.predict(X_valid) valid_score=accuracy_score(y_valid,valid_predict) print("驗證集上的得分為:{}".format(valid_score))

驗證集上的得分為:0.7967316153362665

 

第四步. 模型預測(可選)

4.1 預測測試集

編程練習
將模型grid對測試集test_tfidf做預測,然后查看預測結果。

### TODO:預測測試結果
predictions = grid.predict(test_tfidf) print("預測的測試集個數為:{}".format(len(predictions))) test_content['cuisine']=predictions test_content.head(10)

預測的測試集個數為:9944

4.2 提交結果

## 加載結果格式
submit_frame = pd.read_csv("sample_submission.csv") ## 保存結果
result = pd.merge(submit_frame, test_content, on="id", how='left') result = result.rename(index=str, columns={"cuisine_y": "cuisine"}) test_result_name = "tfidf_cuisine_test.csv" result[['id','cuisine']].to_csv(test_result_name,index=False)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM