機器學習工程師 - Udacity 項目 0: 預測你的下一道世界料理

本文轉載自查看原文 2018-11-08 21:06 836 機器學習工程師 - Udacity

第一步. 下載並導入數據

1.1 數據集：https://www.kaggle.com/c/whats-cooking/data

1.2 加載數據

# 導入依賴庫
import json import codecs import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline # 加載數據集
train_filename='train.json' train_content = pd.read_json(codecs.open(train_filename, mode='r', encoding='utf-8')) test_filename = 'test.json' test_content = pd.read_json(codecs.open(test_filename, mode='r', encoding='utf-8')) # 打印加載的數據集數量
print("菜名數據集一共包含 {} 訓練數據 和 {} 測試樣例。\n".format(len(train_content), len(test_content))) if len(train_content)==39774 and len(test_content)==9944: print("數據成功載入！") else: print("數據載入有問題，請檢查文件路徑！")

菜名數據集一共包含 39774 訓練數據和 9944 測試樣例。
數據成功載入！

1.3 數據預覽
為了查看我們的數據集的分布和菜品總共的種類，我們打印出部分數據樣例。

pd.set_option('display.max_colwidth',120)

編程練習
你需要通過head()函數來預覽訓練集train_content數據。（輸出前5條）

### TODO：打印train_content中前5個數據樣例以預覽數據
print(train_content.head())

cuisine id \
0 greek 10259
1 southern_us 25693
2 filipino 20130
3 indian 22213
4 indian 13162

ingredients
0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...

## 查看總共菜品分類
categories=np.unique(train_content['cuisine']) print("一共包含 {} 種菜品，分別是:\n{}".format(len(categories),categories))

一共包含 20 種菜品，分別是:
['brazilian' 'british' 'cajun_creole' 'chinese' 'filipino' 'french' 'greek'
'indian' 'irish' 'italian' 'jamaican' 'japanese' 'korean' 'mexican'
'moroccan' 'russian' 'southern_us' 'spanish' 'thai' 'vietnamese']

第二步. 分析數據
由於這個項目的最終目標是建立一個預測世界菜系的模型，我們需要將數據集分為特征(Features)和目標變量(Target Variables)。

特征: 'ingredients'，給我們提供了每個菜品所包含的佐料名稱。
目標變量：'cuisine'，是我們希望預測的菜系分類。
他們分別被存在 train_ingredients 和 train_targets 兩個變量名中。

編程練習：數據提取
將train_content中的ingredients賦值到train_integredients
將train_content中的cuisine賦值到train_targets

### TODO：將特征與目標變量分別賦值
train_ingredients = train_content['ingredients'] train_targets = train_content['cuisine'] ### TODO: 打印結果，檢查是否正確賦值
print(train_ingredients) print(train_targets)

0 [romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese...
1 [plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil...
2 [eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so...
3 [water, vegetable oil, wheat, salt]
4 [black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch...
5 [plain flour, sugar, butter, eggs, fresh ginger root, salt, ground cinnamon, milk, vanilla extract, ground ginger, p...
6 [olive oil, salt, medium shrimp, pepper, garlic, chopped cilantro, jalapeno chilies, flat leaf parsley, skirt steak,...
7 [sugar, pistachio nuts, white almond bark, flour, vanilla extract, olive oil, almond extract, eggs, baking powder, d...
8 [olive oil, purple onion, fresh pineapple, pork, poblano peppers, corn tortillas, cheddar cheese, ground black peppe...
9 [chopped tomatoes, fresh basil, garlic, extra-virgin olive oil, kosher salt, flat leaf parsley]
10 [pimentos, sweet pepper, dried oregano, olive oil, garlic, sharp cheddar cheese, pepper, swiss cheese, provolone che...
11 [low sodium soy sauce, fresh ginger, dry mustard, green beans, white pepper, sesame oil, scallions, canola oil, suga...
12 [Italian parsley leaves, walnuts, hot red pepper flakes, extra-virgin olive oil, fresh lemon juice, trout fillet, ga...
13 [ground cinnamon, fresh cilantro, chili powder, ground coriander, kosher salt, ground black pepper, garlic, plum tom...
14 [fresh parmesan cheese, butter, all-purpose flour, fat free less sodium chicken broth, chopped fresh chives, gruyere...
15 [tumeric, vegetable stock, tomatoes, garam masala, naan, red lentils, red chili peppers, onions, spinach, sweet pota...
16 [greek yogurt, lemon curd, confectioners sugar, raspberries]
17 [italian seasoning, broiler-fryer chicken, mayonaise, zesty italian dressing]
18 [sugar, hot chili, asian fish sauce, lime juice]
19 [soy sauce, vegetable oil, red bell pepper, chicken broth, yellow squash, garlic chili sauce, sliced green onions, b...
20 [pork loin, roasted peanuts, chopped cilantro fresh, hoisin sauce, creamy peanut butter, chopped fresh mint, thai ba...
21 [roma tomatoes, kosher salt, purple onion, jalapeno chilies, lime, chopped cilantro]
22 [low-fat mayonnaise, pepper, salt, baking potatoes, eggs, spicy brown mustard]
23 [sesame seeds, red pepper, yellow peppers, water, extra firm tofu, broccoli, soy sauce, orange bell pepper, arrowroo...
24 [marinara sauce, flat leaf parsley, olive oil, linguine, capers, crushed red pepper flakes, olives, lemon zest, garlic]
25 [sugar, lo mein noodles, salt, chicken broth, light soy sauce, flank steak, beansprouts, dried black mushrooms, pepp...
26 [herbs, lemon juice, fresh tomatoes, paprika, mango, stock, chile pepper, onions, red chili peppers, oil]
27 [ground black pepper, butter, sliced mushrooms, sherry, salt, grated parmesan cheese, heavy cream, spaghetti, chicke...
28 [green bell pepper, egg roll wrappers, sweet and sour sauce, corn starch, molasses, vegetable oil, oil, soy sauce, s...
29 [flour tortillas, cheese, breakfast sausages, large eggs]
...
39744 [extra-virgin olive oil, oregano, potatoes, garlic cloves, pepper, salt, yellow mustard, fresh lemon juice]
39745 [quinoa, extra-virgin olive oil, fresh thyme leaves, scallion greens]
39746 [clove, bay leaves, ginger, chopped cilantro, ground turmeric, white onion, cinnamon, cardamom pods, serrano chile, ...
39747 [water, sugar, grated lemon zest, butter, pitted date, blanched almonds]
39748 [sea salt, pizza doughs, all-purpose flour, cornmeal, extra-virgin olive oil, shredded mozzarella cheese, kosher sal...
39749 [kosher salt, minced onion, tortilla chips, sugar, tomato juice, cilantro leaves, avocado, lime juice, roma tomatoes...
39750 [ground black pepper, chicken breasts, salsa, cheddar cheese, pepper jack, heavy cream, red enchilada sauce, unsalte...
39751 [olive oil, cayenne pepper, chopped cilantro fresh, boneless chicken skinless thigh, fine sea salt, low salt chicken...
39752 [self rising flour, milk, white sugar, butter, peaches in light syrup]
39753 [rosemary sprigs, lemon zest, garlic cloves, ground black pepper, vegetable broth, fresh basil leaves, minced garlic...
39754 [jasmine rice, bay leaves, sticky rice, rotisserie chicken, chopped cilantro, large eggs, vegetable oil, yellow onio...
39755 [mint leaves, cilantro leaves, ghee, tomatoes, cinnamon, oil, basmati rice, garlic paste, salt, coconut milk, clove,...
39756 [vegetable oil, cinnamon sticks, water, all-purpose flour, piloncillo, salt, orange zest, baking powder, hot water]
39757 [red bell pepper, garlic cloves, extra-virgin olive oil, feta cheese crumbles]
39758 [milk, salt, ground cayenne pepper, ground lamb, ground cinnamon, ground black pepper, pomegranate, chopped fresh mi...
39759 [red chili peppers, sea salt, onions, water, chilli bean sauce, caster sugar, garlic, white vinegar, chili oil, cucu...
39760 [butter, large eggs, cornmeal, baking powder, boiling water, milk, salt]
39761 [honey, chicken breast halves, cilantro leaves, carrots, soy sauce, Sriracha, wonton wrappers, freshly ground pepper...
39762 [curry powder, salt, chicken, water, vegetable oil, basmati rice, eggs, finely chopped onion, lemon juice, pepper, m...
39763 [fettuccine pasta, low-fat cream cheese, garlic, nonfat evaporated milk, grated parmesan cheese, corn starch, nonfat...
39764 [chili powder, worcestershire sauce, celery, red kidney beans, lean ground beef, stewed tomatoes, dried parsley, pep...
39765 [coconut, unsweetened coconut milk, mint leaves, plain yogurt]
39766 [rutabaga, ham, thick-cut bacon, potatoes, fresh parsley, salt, onions, pepper, carrots, pork sausages]
39767 [low-fat sour cream, grated parmesan cheese, salt, dried oregano, low-fat cottage cheese, butter, onions, olive oil,...
39768 [shredded cheddar cheese, crushed cheese crackers, cheddar cheese soup, cream of chicken soup, hot sauce, diced gree...
39769 [light brown sugar, granulated sugar, butter, warm water, large eggs, all-purpose flour, whole wheat flour, cooking ...
39770 [KRAFT Zesty Italian Dressing, purple onion, broccoli florets, rotini, pitted black olives, Kraft Grated Parmesan Ch...
39771 [eggs, citrus fruit, raisins, sourdough starter, flour, hot tea, sugar, ground nutmeg, salt, ground cinnamon, milk, ...
39772 [boneless chicken skinless thigh, minced garlic, steamed white rice, baking powder, corn starch, dark soy sauce, kos...
39773 [green chile, jalapeno chilies, onions, ground black pepper, salt, chopped cilantro fresh, green bell pepper, garlic...
Name: ingredients, Length: 39774, dtype: object
0 greek
1 southern_us
2 filipino
3 indian
4 indian
5 jamaican
6 spanish
7 italian
8 mexican
9 italian
10 italian
11 chinese
12 italian
13 mexican
14 italian
15 indian
16 british
17 italian
18 thai
19 vietnamese
20 thai
21 mexican
22 southern_us
23 chinese
24 italian
25 chinese
26 cajun_creole
27 italian
28 chinese
29 mexican
...
39744 greek
39745 spanish
39746 indian
39747 moroccan
39748 italian
39749 mexican
39750 mexican
39751 moroccan
39752 southern_us
39753 italian
39754 vietnamese
39755 indian
39756 mexican
39757 greek
39758 greek
39759 korean
39760 southern_us
39761 chinese
39762 indian
39763 italian
39764 mexican
39765 indian
39766 irish
39767 italian
39768 mexican
39769 irish
39770 italian
39771 irish
39772 chinese
39773 mexican
Name: cuisine, Length: 39774, dtype: object

編程練習：基礎統計運算
使用最頻繁的佐料前10分別有哪些？
意大利菜中最常見的10個佐料有哪些？

## TODO: 統計佐料出現次數，並賦值到sum_ingredients字典中
m = [] for i in range(len(train_ingredients)): m += train_ingredients[i] sum_ingredients = pd.Series(m).value_counts().to_dict()

or：

from collections import defaultdict sum_ingredients = defaultdict(int) for row in train_ingredients: for item in row: sum_ingredients[item] += 1 sum_ingredients = dict(sum_ingredients)

# Finally, plot the 10 most used ingredients
plt.style.use(u'ggplot') fig = pd.DataFrame(sum_ingredients, index=[0]).transpose()[0].sort_values(ascending=False, inplace=False)[:10].plot(kind='barh') fig.invert_yaxis() fig = fig.get_figure() fig.tight_layout()

## TODO: 統計意大利菜系中佐料出現次數，並賦值到italian_ingredients字典中
list_italian = train_content.loc[train_content['cuisine'].isin(['italian'])]['ingredients'].reset_index(drop=True) n = [] for j in range(len(list_italian)): n += list_italian[j] italian_ingredients = pd.Series(n).value_counts().to_dict()

or：

cuisine_ingredients = zip(train_targets, train_ingredients) for cuisine, ingredients in cuisine_ingredients: if cuisine == 'italian': for item in ingredients: if item in italian_ingredients: italian_ingredients[item] += 1
            else: italian_ingredients[item] = 1

第三步. 建立模型

3.1 單詞清洗
由於菜品包含的佐料眾多，同一種佐料也可能有單復數、時態等變化，為了去除這類差異，我們考慮將ingredients 進行過濾

import re from nltk.stem import WordNetLemmatizer import numpy as np def text_clean(ingredients): #去除單詞的標點符號，只保留 a..z A...Z的單詞字符
    ingredients= np.array(ingredients).tolist() print("菜品佐料：\n{}".format(ingredients[9])) ingredients=[[re.sub('[^A-Za-z]', ' ', word) for word in component]for component in ingredients] print("去除標點符號之后的結果：\n{}".format(ingredients[9])) # 去除單詞的單復數，時態，只保留單詞的詞干
    lemma=WordNetLemmatizer() ingredients=[" ".join([ " ".join([lemma.lemmatize(w) for w in words.split(" ")]) for words in component])  for component in ingredients] print("去除時態和單復數之后的結果：\n{}".format(ingredients[9])) return ingredients print("\n處理訓練集...") train_ingredients = text_clean(train_content['ingredients']) print("\n處理測試集...") test_ingredients = text_clean(test_content['ingredients'])

處理訓練集...
菜品佐料：
['chopped tomatoes', 'fresh basil', 'garlic', 'extra-virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除標點符號之后的結果：
['chopped tomatoes', 'fresh basil', 'garlic', 'extra virgin olive oil', 'kosher salt', 'flat leaf parsley']
去除時態和單復數之后的結果：
chopped tomato fresh basil garlic extra virgin olive oil kosher salt flat leaf parsley

處理測試集...
菜品佐料：
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除標點符號之后的結果：
['eggs', 'cherries', 'dates', 'dark muscovado sugar', 'ground cinnamon', 'mixed spice', 'cake', 'vanilla extract', 'self raising flour', 'sultana', 'rum', 'raisins', 'prunes', 'glace cherries', 'butter', 'port']
去除時態和單復數之后的結果：
egg cherry date dark muscovado sugar ground cinnamon mixed spice cake vanilla extract self raising flour sultana rum raisin prune glace cherry butter port

3.2 特征提取
在該步驟中，我們將菜品的佐料轉換成數值特征向量。考慮到絕大多數菜中都包含salt, water, sugar, butter等，采用one-hot的方法提取的向量將不能很好的對菜系作出區分。我們將考慮按照佐料出現的次數對佐料做一定的加權，即：佐料出現次數越多，佐料的區分性就越低。我們采用的特征為TF-IDF，相關介紹內容可以參考：TF-IDF與余弦相似性的應用（一）：自動提取關鍵詞。

from sklearn.feature_extraction.text import TfidfVectorizer # 將佐料轉換成特征向量

# 處理 訓練集
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 1), analyzer='word', max_df=.57, binary=False, token_pattern=r"\w+",sublinear_tf=False) train_tfidf = vectorizer.fit_transform(train_ingredients).todense() ## 處理 測試集
test_tfidf = vectorizer.transform(test_ingredients)

train_targets=np.array(train_content['cuisine']).tolist() train_targets[:10]

['greek',
'southern_us',
'filipino',
'indian',
'indian',
'jamaican',
'spanish',
'italian',
'mexican',
'italian']

編程練習
這里我們為了防止前面步驟中累積的錯誤，導致以下步驟無法正常運行。我們在此檢查處理完的實驗數據是否正確，請打印train_tfidf和train_targets中前五個數據。

# 你需要通過head()函數來預覽訓練集train_tfidf,train_targets數據
print(train_tfidf[:5]) print(train_targets[:5])

[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
['greek', 'southern_us', 'filipino', 'indian', 'indian']

3.3 驗證集划分
為了在實驗中大致估計模型的精確度我們將從原本的train_ingredients 划分出 20% 的數據用作valid_ingredients。

編程練習：數據分割與重排
調用train_test_split函數將訓練集划分為新的訓練集和驗證集，便於之后的模型精度觀測。

從sklearn.model_selection中導入train_test_split
將train_tfidf和train_targets作為train_test_split的輸入變量
設置test_size為0.2，划分出20%的驗證集，80%的數據留作新的訓練集。
設置random_state隨機種子，以確保每一次運行都可以得到相同划分的結果。（隨機種子固定，生成的隨機序列就是確定的）

### TODO：划分出驗證集
from sklearn.model_selection import train_test_split X_train , X_valid , y_train, y_valid = train_test_split(train_tfidf, train_targets, test_size = 0.2, random_state=0)

3.2 建立模型
調用 sklearn 中的邏輯回歸模型（Logistic Regression）。

編程練習：訓練模型

從sklearn.linear_model導入LogisticRegression
從sklearn.model_selection導入GridSearchCV, 參數自動搜索，只要把參數輸進去，就能給出最優的結果和參數，這個方法適合小數據集。
定義parameters變量：為C參數創造一個字典，它的值是從1至10的數組;
定義classifier變量: 使用導入的LogisticRegression創建一個分類函數;
定義grid變量: 使用導入的GridSearchCV創建一個網格搜索對象；將變量'classifier', 'parameters'作為參數傳至這個對象構造函數中；

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV ## TODO: 建立邏輯回歸模型
parameters = {'C':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]} classifier = LogisticRegression() grid = GridSearchCV(classifier, parameters) grid = grid.fit(X_train, y_train)

模型訓練結束之后，我們計算模型在驗證集X_valid上預測結果，並計算模型的預測精度（與y_valid逐個比較）。

from sklearn.metrics import accuracy_score ## 計算模型的准確率
 valid_predict = grid.predict(X_valid) valid_score=accuracy_score(y_valid,valid_predict) print("驗證集上的得分為：{}".format(valid_score))

驗證集上的得分為：0.7967316153362665

第四步. 模型預測（可選）

4.1 預測測試集

編程練習
將模型grid對測試集test_tfidf做預測，然后查看預測結果。

### TODO：預測測試結果
predictions = grid.predict(test_tfidf) print("預測的測試集個數為：{}".format(len(predictions))) test_content['cuisine']=predictions test_content.head(10)

預測的測試集個數為：9944

4.2 提交結果

## 加載結果格式
submit_frame = pd.read_csv("sample_submission.csv") ## 保存結果
result = pd.merge(submit_frame, test_content, on="id", how='left') result = result.rename(index=str, columns={"cuisine_y": "cuisine"}) test_result_name = "tfidf_cuisine_test.csv" result[['id','cuisine']].to_csv(test_result_name,index=False)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習工程師 - Udacity 項目 1: 預測波士頓房價機器學習工程師 - Udacity 項目 3: 創建用戶分類機器學習工程師 - Udacity 項目：實現一個狗品種識別算法App 機器學習工程師 - Udacity 強化學習 Part Six 搞機器學習要哪些技能/算法工程師的技能如何准備機器學習工程師的面試？機器學習算法工程師實習面試總結【推薦算法工程師技術棧系列】機器學習深度學習--強化學習 2018 年大疆機器學習算法工程師春季提前批筆試題大二機器學習算法工程師實習生面經