機器學習基礎 --- pandas的基本使用


一、pandas的簡介

  Python Data Analysis Library 或 pandas 是基於NumPy 的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。

pandas的數據結構:

  Series:一維數組,與Numpy中的一維array類似。二者與Python基本的數據結構List也很相近,其區別是:List中的元素可以是不同的數據類型,而Array和Series中則只允許存儲相同的數據類型,這樣可以更有效的使用內存,提高運算效率。

  Time- Series:以時間為索引的Series。

  DataFrame:二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解為Series的容器。以下的內容主要以DataFrame為主。

  Panel :三維的數組,可以理解為DataFrame的容器。

  本文主要介紹DateFrame和Series,其中DataFrame充電介紹。

  本文中用到的數據文件地址:pandas的基本使用.zip

  本文只是結合實例介紹pandas的基本使用,若要詳細深入學習,請參閱pandas官方文檔

二、pandas中的DateFrame

  使用pandas我們可以很方便的對二維表結構進行一些常規操作。 

1. 使用pandas讀取csv(或excel等)文件

import pandas
food_info = pandas.read_csv("food_info.csv")          # 讀取csv文件
# 讀取Excel文件使用pandas.read_excel()即可
print(type(food_info))           # food_info為一個DataFrame對象
print(food_info.dtypes)          # 各項數據的類型
<class 'pandas.core.frame.DataFrame'>
NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object
輸出  

2.  獲取數據

food_info.head(10)     # 獲取前10行數據,默認獲取5行
# first_rows = food_info.head()
# first_rows
# food_info.tail(8)     # 獲取尾8行數據,默認獲取5行   
# print(food_info.tail())
print(food_info.columns)    # 獲取foodinfo的各字段名(即表頭)# print(food_info.shape)    # 獲取結構  比如此文件時8618行×36列
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')
輸出1:
 
         
# print(food_info.loc[0])     # 獲取第0行數據
print(food_info.loc[6000])          # 獲取第6000行數據
# food_info.loc[10000] # 獲取第10000行數據,超過數據文件本身長度,報錯KeyError: 'the label [10000] is not in the [index]'
NDB_No                                                         18995
Shrt_Desc          KELLOGG'S EGGO BISCUIT SCRAMBLERS BACON EGG & CHS
Water_(g)                                                       42.9
Energ_Kcal                                                       258
Protein_(g)                                                      8.8
Lipid_Tot_(g)                                                    7.9
Ash_(g)                                                          NaN
Carbohydrt_(g)                                                  38.3
Fiber_TD_(g)                                                     2.1
Sugar_Tot_(g)                                                    4.7
Calcium_(mg)                                                     124
Iron_(mg)                                                        2.7
Magnesium_(mg)                                                    14
Phosphorus_(mg)                                                  215
Potassium_(mg)                                                   225
Sodium_(mg)                                                      610
Zinc_(mg)                                                        0.5
Copper_(mg)                                                      NaN
Manganese_(mg)                                                   NaN
Selenium_(mcg)                                                   NaN
Vit_C_(mg)                                                       NaN
Thiamin_(mg)                                                     0.3
Riboflavin_(mg)                                                 0.26
Niacin_(mg)                                                      2.4
Vit_B6_(mg)                                                     0.02
Vit_B12_(mcg)                                                    0.1
Vit_A_IU                                                         NaN
Vit_A_RAE                                                        NaN
Vit_E_(mg)                                                         0
Vit_D_mcg                                                          0
Vit_D_IU                                                           0
Vit_K_(mcg)                                                      NaN
FA_Sat_(g)                                                       4.1
FA_Mono_(g)                                                      1.5
FA_Poly_(g)                                                      1.1
Cholestrl_(mg)                                                    27
Name: 6000, dtype: object
輸出2
# food_info.loc[3:6]        # 獲取第3到6行數據
two_five_ten = [2,5,10]     
print(food_info.loc[two_five_ten])   # 獲取第2,5,10數據
    NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
2     1003  BUTTER OIL ANHYDROUS       0.24         876         0.28   
5     1006           CHEESE BRIE      48.42         334        20.75   
10    1011          CHEESE COLBY      38.20         394        23.76   

    Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
2           99.48     0.00            0.00           0.0           0.00   
5           27.68     2.70            0.45           0.0           0.45   
10          32.11     3.36            2.57           0.0           0.52   

         ...        Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  \
2        ...          3069.0      840.0        2.80        1.8      73.0   
5        ...           592.0      174.0        0.24        0.5      20.0   
10       ...           994.0      264.0        0.28        0.6      24.0   

    Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
2           8.6      61.924       28.732        3.694           256.0  
5           2.3      17.410        8.013        0.826           100.0  
10          2.7      20.218        9.280        0.953            95.0  
輸出3
 
         
# food_info['Shrt_Desc']     # 獲取字段名為'Shrt_Desc'的這一列
ndb_col = food_info['NDB_No']    # 獲取字段名為'NDB_No'的這一列
# print(ndb_col)
col_name = 'Shrt_Desc'
print(food_info[col_name])
0                                        BUTTER WITH SALT
1                                BUTTER WHIPPED WITH SALT
2                                    BUTTER OIL ANHYDROUS
3                                             CHEESE BLUE
4                                            CHEESE BRICK
5                                             CHEESE BRIE
6                                        CHEESE CAMEMBERT
7                                          CHEESE CARAWAY
8                                          CHEESE CHEDDAR
9                                         CHEESE CHESHIRE
10                                           CHEESE COLBY
11                    CHEESE COTTAGE CRMD LRG OR SML CURD
12                            CHEESE COTTAGE CRMD W/FRUIT
13       CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD
14                       CHEESE COTTAGE LOWFAT 2% MILKFAT
15                       CHEESE COTTAGE LOWFAT 1% MILKFAT
16                                           CHEESE CREAM
17                                            CHEESE EDAM
18                                            CHEESE FETA
19                                         CHEESE FONTINA
20                                         CHEESE GJETOST
21                                           CHEESE GOUDA
22                                         CHEESE GRUYERE
23                                       CHEESE LIMBURGER
24                                        CHEESE MONTEREY
25                             CHEESE MOZZARELLA WHL MILK
26                    CHEESE MOZZARELLA WHL MILK LO MOIST
27                       CHEESE MOZZARELLA PART SKIM MILK
28                   CHEESE MOZZARELLA LO MOIST PART-SKIM
29                                        CHEESE MUENSTER
                              ...                        
8588           BABYFOOD CRL RICE W/ PEARS & APPL DRY INST
8589                       BABYFOOD BANANA NO TAPIOCA STR
8590                       BABYFOOD BANANA APPL DSSRT STR
8591         SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL)
8592    CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD
8593                           POPCORN MICROWAVE LOFAT&NA
8594                         BABYFOOD FRUIT SUPREME DSSRT
8595                                 CHEESE SWISS LOW FAT
8596               BREAKFAST BAR CORN FLAKE CRUST W/FRUIT
8597                              CHEESE MOZZARELLA LO NA
8598                             MAYONNAISE DRSNG NO CHOL
8599                            OIL CORN PEANUT AND OLIVE
8600                     SWEETENERS TABLETOP FRUCTOSE LIQ
8601                                CHEESE FOOD IMITATION
8602                                  CELERY FLAKES DRIED
8603             PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX
8604                      BABYFOOD GRAPE JUC NO SUGAR CND
8605                     JELLIES RED SUGAR HOME PRESERVED
8606                           PIE FILLINGS BLUEBERRY CND
8607                 COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ
8608              PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX
8609    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX
8610    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY...
8611                                   VITAL WHEAT GLUTEN
8612                                        FROG LEGS RAW
8613                                      MACKEREL SALTED
8614                           SCALLOP (BAY&SEA) CKD STMD
8615                                           SYRUP CANE
8616                                            SNAIL RAW
8617                                     TURTLE GREEN RAW
Name: Shrt_Desc, Length: 8618, dtype: object
輸出4
 
         
columns = ['Water_(g)', 'Shrt_Desc']   
zinc_copper = food_info[columns]      # 獲取字段名為'Water_(g)', 'Shrt_Desc'的這兩列
print(zinc_copper)


# 獲取以"(mg)"結尾的各列數據 col_names = food_info.columns.tolist() # print(col_names) milligram_columns = [] for items in col_names: if items.endswith("(mg)"): milligram_columns.append(items) milligram_df = food_info[milligram_columns] print(milligram_df)

 

3. 對數據的簡單處理:

import pandas

food_info = pandas.read_csv('food_info.csv')
# food_info.head(3)
# print(food_info.shape) 

# print(food_info['Iron_(mg)'])
# Iron_(mg)這一列的單位是mg,將其轉為mg,對其值除以1000
div_1000 = food_info['Iron_(mg)'] / 1000
# print(div_1000) 

# 對每行數據中的其中兩列進行計算
water_energy = food_info['Water_(g)'] * food_info['Energ_Kcal']  
# print(food_info.shape)
# DateFrame結構插入一列,字段名為'water_energy',值為water_energy的數據
food_info['water_energy'] = water_energy
# print(food_info[['Water_(g)', 'Energ_Kcal', 'water_energy']])
# print(food_info.shape)

# 求某列的最大值
max_calories = food_info['Energ_Kcal'].max()
# print(max_calories)

# 對指定字段排序,inplace=False將排序后的結果生成一個新的DataFrame,inplace=True則在原來的基礎上進行排序,默認升序排序
# food_info.sort_values('Sodium_(mg)', inplace=True)
# print(food_info['Sodium_(mg)'])
a = food_info.sort_values('Sodium_(mg)', inplace=False, ascending=False)  # ascending=False 使用降序排序
# print(food_info['Sodium_(mg)'])
# print(a['Sodium_(mg)'])

 

4. 對數據的常規操作

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv('titanic_train.csv')
# titanic_survival.head()

age = titanic_survival['Age']
# print(age.loc[0:10])
age_is_null = pd.isnull(age)    # 迭代判斷值是否為空,結果可以作為一個索引
# print(age_is_null)
age_null_true = age[age_is_null]   # 獲取值為空的數據集
# print(age_null_true)
print(len(age_null_true))     # 判斷一共有多少個空數據


# 求平均值,應用不為空的數據集求
good_ages = age[age_is_null == False]     # 獲取值不為空的數據集
# print(good_ages)
correct_mean_age = sum(good_ages) / len(good_ages)   # 求平均
print(correct_mean_age)
# 或者使用pandas內置的求均值函數,自動去除空數據
correct_mean_age = age.mean()   # 求平均,將空值舍棄
print(correct_mean_age)


# pivot_table方法默認求平均值,如果需求是求平均aggfunc參數可以不寫
# index tells the method which column to group by
# values is the column that we want to apply the calculation to
# aggfunc specifies the calculation we want to perform
passenger_surival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)  # 對index相同的分別求平均值
print(passenger_surival)

# 分組對多列求和
# port_stats = titanic_survival.pivot_table(index="Embarked", values=['Fare', "Survived"], aggfunc=np.sum)  # ,分別對價格和存活人數求和
# print(port_stats)


# 丟棄空值數據
drop_na_columns = titanic_survival.dropna(axis=1, inplace=False)    # axis=1,以行為判斷依據,數據為空,則從Dataframe中丟棄,inplace=False返回一個新的Dataframe對象,否則對當前對象做操作
# print(drop_na_columns)
new_titanic_survival = titanic_survival.dropna(axis=0, subset=['Age', 'Sex'], inplace=False)  # axis=0,以列為判斷依據,需要指定判斷列的字段,數據為空,則從Dataframe中丟棄
# print(new_titanic_survival)


# 具體定位到某行某列
row_index_83_age = titanic_survival.loc[83, 'Age']
row_index_766_pclass = titanic_survival.loc[766, 'Pclass']
print(row_index_83_age)
print(row_index_766_pclass)


new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)   # 每行的年齡按降序排序    
print(new_titanic_survival[0:10])
print('------------------------>')
titanic_reindexed = new_titanic_survival.reset_index(drop=True)    # 重置每行的索引值
print(titanic_reindexed[0:20])


# 自定義函數,對每行或每列逐個使用
def null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)
column_null_count = titanic_survival.apply(null_count, axis=0)    # 通過自定義函數,統計每列為空的個數
print(column_null_count)


def which_class(row):
    pclass = row['Pclass'] 
    if pclass == 1:
        return 'First Class'
    elif pclass == 2:
        return 'Second Class'
    elif pclass == 3:
        return 'Third Class'
    else:
        return 'Unknow'
classes = titanic_survival.apply(which_class, axis=1)    # 通過自定義函數,替換每行的Pclass值, 注意axis=1
print(classes)

 

5. 配合numpy將數據載入后進行預處理

import pandas as pd
import numpy as np

fandango = pd.read_csv('fandango_score_comparison.csv')
# print(type(fandango))
# 返回一個新的dataframe,返回的新數據以設定的值為index,並將丟棄index值為空的數據,drop=True,丟棄為索引的列,否則不丟棄
fandango_films = fandango.set_index('FILM', drop=False)
# fandango_films
# print(fandango_films.index)

# 按索引獲取數據
fandango_films["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]

fandango_films.loc['Southpaw (2015)']

movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]




# def func(coloumn):
#     return np.std(coloumn)

types = fandango_films.dtypes
# print(types)

float_columns = types[types.values == 'float64'].index  # 獲取特定類型的數據的索引
# print(float_columns)
float_df = fandango_films[float_columns]        # 獲取特定類型的數據
# print(float_df.dtypes)
# float_df
# print(float_df)
deviations = float_df.apply(lambda x: np.std(x))    # 計算每列標准差
print(deviations)
# print('----------------------->')
# print(float_df.apply(func))
# help(np.std)

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
print(rt_mt_user.apply(np.std, axis=1))   # 計算每行數據標准差
# rt_mt_user.apply(np.std, axis=0)

三、DataFrame中的Series

   Series為DateFrame中一行或一列的數據結構

1. 獲取一個Series對象

import pandas as pd
from pandas import Series

fandango = pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']   # 獲取fandango中FILM這一列
# print(type(series_film))
print(series_film[0:5])
series_rt = fandango['RottenTomatoes']  # 獲取fandango中RottenTomatoes這一列
print(series_rt[0:5])

 

2. 對Series對象的一些常規操作

file_names = series_film.values  # 獲取series_film的所有值,返回值為一個<class 'numpy.ndarray'>
# print(type(file_names))
# print(file_names)
rt_scores = series_rt.values
# print(rt_scores)
series_custom = Series(rt_scores, index=file_names)   # 構建一個新的Series, index為file_names, value為rt_scores
# help(Series)
print(series_custom[['Top Five (2014)', 'Night at the Museum: Secret of the Tomb (2014)']])  # 以index獲取數據
# print(type(series_custom))
print('--------------------------------->')
print(series_custom[5:10])   # 切片操作


# print(series_custom[["'71 (2015)"]])
original_index = series_custom.index.tolist()   # 獲取所有的index值並將其轉為list
# print(original_index)
sorted_index = sorted(original_index)   # 對list排序
# print(sort_index)
sorted_by_index = series_custom.reindex(sorted_index)   # 以排過序的list重新為series_custom設置索引
print(sorted_by_index)



sc2 = series_custom.sort_index()     # 以index按升序排序整個series_custom
# print(sc2)
sc3 = series_custom.sort_values(ascending=False)   # 以values按降序排序整個series_custom
print(sc3)



import numpy as np
# print(np.add(series_custom, series_custom))   #將series_custom當成一個矩陣,使用numpy進行計算
print(np.sin(series_custom))
print(np.max(series_custom))


# series_custom > 50
series_greater_than_50 = series_custom[series_custom > 50]    # 獲取series_custom的值大於50的數據
# series_greater_than_50



criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]   # 獲取series_custom的值大於50且小於75的數據
print(both_criteria)




rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users) / 2  # 將rt_critics 和 rt_users的值相加除以2
print(rt_mean)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM