飲冰三年-人工智能-Pandas-74-初始Pandas

本文轉載自查看原文 2022-05-28 09:24 3510

一、什么是Pandas

Pandas，python+data+analysis的組合縮寫，是python中基於numpy和matplotlib的第三方數據分析庫，與后兩者共同構成了python數據分析的基礎工具包，享有數分三劍客之名。

Pandas 的目標是成為 Python 數據分析實踐與實戰的必備高級工具，其長遠目標是成為最強大、最靈活、可以支持任何語言的開源數據分析工具。

高性能
易使用的數據結構
易使用的數據分析工具

python 社區已經廣泛接受了一些常用模塊的命名慣例：

　　import numpy as np

　　import pandas as pd

　　import matplotlib.pyplot as plt

也就是說，當你看到np.arange時，就應該想到它引用的是NumPy中的arange函數。這樣做的原因是：在Python軟件開發中，不建議直接引入類似NumPy這種大型庫的全部內容（from numpy import *）。

二、Pandas數據讀取

1、讀取txt、CSV等文本數據

　　1.1 按照逗號分割的txt文本文件

df = pd.read_csv('C:\\Users\\ywx1106919\\Desktop\\ex1.txt', encoding='utf-8')

　　1.2 按照tab分割的csv文件

df = pd.read_table('C:\\Users\\ywx1106919\\Desktop\\ex2.csv', encoding='utf-8')

　1.3 常用方法總結：

函數	說明
read_csv	從文件、URL、文件型對象中加載帶分隔符的數據。默認分隔符為逗號。
read_table	從文件、URL、文件型對象中加載帶分隔符的數據。默認分隔符為制表符（“\t”）。
read_fwf	讀取定寬列格式數據（也就是說，沒有分隔符）
read_clipboard	讀取剪貼板中的數據，可以看做read_table的剪切板版。在將網頁轉換成表格時很有用。

　1.4 拋磚引玉，后面有機會詳細介紹　

我們可以通過sep參數修改默認分隔符。
我們可以通過header參數申明標題行。
我們可以通過names參數自定制列名。
我們可以通過index_col參數設置索引列。

2、讀取Excel中數據

　　ps：使用read_excel 需要安裝openpyxl包

pip install openpyxl

　　2.1 讀取Excel文件

df = pd.read_excel('C:\\Users\\ywx1106919\\Desktop\\ex3.xlsx')

3、讀取數據庫中數據

　　3.1 讀取MySQL數據庫中數據

　　ps：使用mysql數據庫需要安裝mysqlclient、pymysql包

pip install pymysql

pip install mysqlclient

    conn = pymysql.connect(host=MYSQL_HOST, port=MYSQL_PORT, db=MYSQL_NAME, user=MYSQL_USER, password=MYSQL_PASSWORD, )
    mysql_page = pd.read_sql("SELECT id,subject_name FROM tb_course", con=conn)

　　3.2 彩蛋：舊版本獲取數據會有警告提示：

 UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

　　新版本的pandas庫中con參數使用sqlalchemy庫創建的create_engine對象。創建create_engine對象(格式類似於URL地址)，

　　需要安裝 sqlalchemy

def get_data_from_db():
    """
    新版本的pandas庫中con參數使用sqlalchemy庫創建的create_engine對象
    創建create_engine對象(格式類似於URL地址)：
    """
    engine = create_engine('mysql+pymysql://%s:%s@%s:%s/%s?charset=utf8'
                           % (MYSQL_USER, MYSQL_PASSWORD, MYSQL_HOST, MYSQL_PORT, MYSQL_NAME))
    mysql_page = pd.read_sql("SELECT * FROM tb_score", engine)
    print(mysql_page)

新版本使用demo

三、Pandas數據結構

要使用pandas，首先就得熟悉它的兩個主要數據結構：Series和DataFrame。雖然它們並不能解決所有問題，但它們為大多數應用提供了一種可靠的、易於使用的基礎。

Series

　　Series是類似於一維數組的對象。它由一組數據（可以是不同數據類型）以及一組與之相關的額數據標簽（索引）組成。

　　1：一組數據即可產生最簡單的Series

import pandas as pd


def series_01():
    pd_s = pd.Series([4, 7, -5, 3])
    print(pd_s)
    print(pd_s[3])


series_01()

代碼

0    4
1    7
2   -5
3    3
dtype: int64
3

輸出結果

　　取值方式：通過索引取值。python中的切片功能同樣適用。

　　展示形式：索引在左，值在右。

我們沒有為數據指定索引，於是會自動創建一個0至N-1的整數索引。你可以通過Series的index和values獲取其索引對象和數組表示形式

　　2：一組數據+索引組成的Series

　　通常，我們希望所創建的Series帶有一個可以對各個數據點進行標記的索引。

def series_02():
    # 創建一個可以對各個數據進行標記的索引
    pd_s = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
    print(pd_s)
    print("------通過下標取值---------")
    print(pd_s[1:3])
    print("------通過索引取值---------")
    print(pd_s[["a", "b", "c"]])
    print("-------查看索引--------")
    print(pd_s.index)
    print("-------查看值--------")
    print(pd_s.values)

View Code

d    4
b    7
a   -5
c    3
dtype: int64
------通過下標取值---------
b    7
a   -5
dtype: int64
------通過索引取值---------
a   -5
b    7
c    3
dtype: int64
-------查看索引--------
Index(['d', 'b', 'a', 'c'], dtype='object')
-------查看值--------
[ 4  7 -5  3]

輸出結果

　　你可以通過索引的方式選取Series中的單個或一組值。上面展示了通過 pd_s[["a", "b", "c"]] 取一組值，可通過pd_s["a"]取單個值。

　　3：數學運算

　　常見的數學運算（如根據布爾型數組進行過濾、標量乘法、應用數學函數等）都會保留索引和值之間的鏈接。

def series_03():
    pd_s = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
    print(pd_s)
    print("------布爾型數組過濾---------")
    print(pd_s[pd_s > 0])
    print("------標量乘法---------")
    print(pd_s * 2)
    print("-------應用數學函數--------")
    # exp()：返回e的冪次方，e是一個常數為2.71828。np.exp(1) 為自身，np.exp(2) 為平方
    print(np.exp(pd_s))

View Code

d    4
b    7
a   -5
c    3
dtype: int64
------布爾型數組過濾---------
d    4
b    7
c    3
dtype: int64
------標量乘法---------
d     8
b    14
a   -10
c     6
dtype: int64
-------應用數學函數--------
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

輸出結果

　　4：Series可以看成一個定長的有序字典。

　　Series它本身是索引值到數據值的一個映射。

def series_04():
    dict_data = {"張三": 20, "李四": 22, "王五": 23, "趙六": 20}
    pd_s = pd.Series(dict_data)  # 字典直接轉
    print(pd_s)
    print("------布爾判斷是否存在---------")
    print("張三" in dict_data, "張三三" in dict_data)
    print("------查看索引---------")
    print(pd_s.index)
    print("-------給索引通過賦值的方式就地修改--------")
    pd_s.index = ["zhangsan", "lisi", "wangwu", "zhaoliu"]
    print(pd_s)

View Code

張三    20
李四    22
王五    23
趙六    20
dtype: int64
------布爾判斷是否存在---------
True False
------查看索引---------
Index(['張三', '李四', '王五', '趙六'], dtype='object')
-------給索引通過賦值的方式就地修改--------
zhangsan    20
lisi        22
wangwu      23
zhaoliu     20
dtype: int64

輸出結果

DataFrame

DataFrame 是一個表格型的數據結構，它含有一組有序的列，每列可以是不同的值類型（數值、字符串、布爾值等），可對應數據庫中的字段。

DataFrame 既有行索引又有列索引，它可以被看做由Series組成的字典。

　　1：根據多個字典序列創建dataframe

　　構建DataFrame的方法有很多最常用的一種是直接傳入一個由等長列表或者NumPy數組組成的字典。

def data_frame_01():
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data)
    print(frame)


data_frame_01()

View Code

    """
      name  year  pop
    0   張三  2020    3
    1   李四  2021    2
    2   王五  2022    1
    3   趙六  2023    4
    """

輸出結果

　　從數據結果可以看出，DataFrame會自動加上索引。同樣的，我們可以指定列的順序和自定義索引。

def data_frame_02():
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])
    print(frame)

View Code

       pop  year name  age
one      3  2020   張三  NaN
two      2  2021   李四  NaN
three    1  2022   王五  NaN
four     4  2023   趙六  NaN

輸出結果

　　跟Series一樣如果傳入的列在數據中找不到，就會產生NA值。

　　2：從dataframe中查出Series

　　如果只查詢一行、一列返回的是series。

- 列篩選：通過類似字典標記的方式或屬性的方式　　frame['pop']
- 行篩選：通過行號獲取行　　frame.iloc[1]
- 行篩選：通過索引獲取行　　frame.loc['one']

def data_frame_03():
    # 如果只查詢一行、一列返回的是series。
    # 列篩選：通過類似字典標記的方式或屬性的方式　　frame['pop']
    # 行篩選：通過行號獲取行　　frame.iloc[1]
    # 行篩選：通過索引獲取行　　frame.loc['one']
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])

    a_col = frame['pop']
    print("-------------獲取某列----------")
    print(a_col)

    a_row_by_iloc = frame.iloc[1]
    print("-------------通過行號獲取行----------")
    print(a_row_by_iloc)

    a_row_by_loc = frame.loc['one']
    print("-------------通過索引獲取行----------")
    print(a_row_by_loc)

View Code

-------------獲取某列----------
one      3
two      2
three    1
four     4
Name: pop, dtype: int64
-------------通過索引獲取行----------
pop        2
year    2021
name      李四
age      NaN
Name: two, dtype: object
-------------通過行號獲取行----------
pop        3
year    2020
name      張三
age      NaN
Name: one, dtype: object

輸出結果

　　如果查詢多行，返回的仍是DataFrame。

- 列篩選：通過類似字典標記的方式或屬性的方式　　frame[['pop'], ['year']]
- 行篩選：通過索引獲取行　　frame.loc[['one', 'three']]
- 行篩選：通過行號獲取多行　　fram.iloc[0:2]

def data_frame_03_n():
    # 如果查詢多行，返回的仍是DataFrame。
    # 列篩選：通過類似字典標記的方式或屬性的方式　　frame[['pop'], ['year']]
    # 行篩選：通過索引獲取行　　frame.loc[['one', 'three']]
    # 行篩選：通過行號獲取多行　　fram.iloc[0:2]

    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])

    m_col = frame[['pop', 'year']]
    print("-------------獲取多列----------")
    print(m_col)

    m_row_by_iloc = frame.iloc[0:2]
    print("-------------通過行號獲取多行----------")
    print(m_row_by_iloc)

    m_row_by_loc = frame.loc[['one', 'three']]
    print("-------------通過索引獲取行----------")
    print(m_row_by_loc)

View Code

-------------獲取多列----------
       pop  year
one      3  2020
two      2  2021
three    1  2022
four     4  2023
-------------通過行號獲取多行----------
     pop  year name  age
one    3  2020   張三  NaN
two    2  2021   李四  NaN
-------------通過索引獲取行----------
       pop  year name  age
one      3  2020   張三  NaN
three    1  2022   王五  NaN

輸出結果

　　3：dataframe修改

　　一言以蔽之：獲取目標，賦值

def data_frame_03_m():
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])

    frame['age'] = 18
    print("-------------獲取某列,然后統一賦一個值----------")
    print(frame)

    frame['age'] = np.arange(4, )
    print("-------------獲取某列,然后賦一個(可遞歸)值----------")
    print(frame)

    val = pd.Series([17, 18], index=['one', 'three', ])
    frame['age'] = val
    print("-------------定義Series【值和索引對應】----------")
    print(frame)

View Code

-------------獲取某列,然后統一賦一個值----------
       pop  year name  age
one      3  2020   張三   18
two      2  2021   李四   18
three    1  2022   王五   18
four     4  2023   趙六   18
-------------獲取某列,然后賦一個(可遞歸)值----------
       pop  year name  age
one      3  2020   張三    0
two      2  2021   李四    1
three    1  2022   王五    2
four     4  2023   趙六    3
-------------定義Series【值和索引對應】----------
       pop  year name   age
one      3  2020   張三  17.0
two      2  2021   李四   NaN
three    1  2022   王五  18.0
four     4  2023   趙六   NaN

輸出結果

　　4：dataframe新增一個列

　　為不存在的列賦值會創建一個新列。

def data_frame_03_a():
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])

    val = pd.Series([17, 18], index=['one', 'three', ])
    frame['age'] = val
    print("-------------定義Series【值和索引對應】----------")
    print(frame)

    frame["has_age"] = frame.age > 0
    print(frame)

View Code

-------------定義Series【值和索引對應】----------
       pop  year name   age
one      3  2020   張三  17.0
two      2  2021   李四   NaN
three    1  2022   王五  18.0
four     4  2023   趙六   NaN
       pop  year name   age  has_age
one      3  2020   張三  17.0     True
two      2  2021   李四   NaN    False
three    1  2022   王五  18.0     True
four     4  2023   趙六   NaN    False

輸出結果

　　5：dataframe 刪除列

　　關鍵字del用於刪除列。

def data_frame_03_d():
    data = {
        'name': ['張三', '李四', '王五', '趙六'],
        'year': [2020, 2021, 2022, 2023],
        'pop': [3, 2, 1, 4],
    }
    frame = pd.DataFrame(data, columns=['pop', 'year', 'name', 'age'], index=['one', 'two', 'three', 'four'])

    del frame["pop"]
    print(frame)

View Code

       year name  age
one    2020   張三  NaN
two    2021   李四  NaN
three  2022   王五  NaN
four   2023   趙六  NaN

輸出結果

　　6：loc和iloc

loc的意思是基於標簽（label-based selection），輸入為標簽。在對數據進行切片操作時，loc與Python中 (:)的含義有差異，左閉右閉
iloc的意思是基於索引（index-based selection），輸入為索引。在對數據進行切片操作時，iloc與Python中 (:)的含義相同，左閉右開

def data_frame_03_di():
    # 另外一種常見的數據形式是嵌套字典，外層字典的鍵作為列，內層鍵作為行索引
    # loc的意思是基於標簽（label-based selection），輸入為標簽。在對數據進行切片操作時，loc與Python中 (:)的含義有差異，左閉右閉
    # iloc的意思是基於索引（index-based selection），輸入為索引。在對數據進行切片操作時，iloc與Python中 (:)的含義相同，左閉右開
    data = {
        "name": {1: "張三", 3: "李四", 5: "王五", 7: "趙六"},
        "year": {1: 2020, 3: 2021, 5: 2022, 7: 2023},
        "pop": {1: 3, 3: 2, 5: 1, 40: 4}
    }
    frame = pd.DataFrame(data)
    print(frame)

    m1_row_by_iloc = frame.iloc[0:3, 1]
    print("-------------通過行號獲取多行----------")
    print(m1_row_by_iloc)

    m2_row_by_iloc = frame.loc[0:3, 'year']
    print("-------------通過行號獲取多行----------")
    print(m2_row_by_iloc)

View Code

   name    year  pop
1    張三  2020.0  3.0
3    李四  2021.0  2.0
5    王五  2022.0  1.0
7    趙六  2023.0  NaN
40  NaN     NaN  4.0
-------------通過行號獲取多行----------
1    2020.0
3    2021.0
5    2022.0
Name: year, dtype: float64
-------------通過行號獲取多行----------
1    2020.0
3    2021.0
Name: year, dtype: float64

輸出結果

四、Pandas重要參數

axis

pandas中有許多函數都有一個重要的參數設置，那就是axis，axis=0指的是逐行，axis=1指的是逐列。

　　1：axis = 0 或者 “index”

- 如果是單行，就指的是某一行
- 如果是聚合，指的是跨行

　　2：axis = 1 或者 “columns”

- 如果是單列，就指的是某一列
- 如果是聚合，指的是跨列

import pandas as pd
import numpy as np


def init_data():
    df_data = pd.DataFrame(np.arange(12).reshape(3, 4), columns=["A", "B", "C", "D"])
    return df_data


def test_one():
    df = init_data()
    print(df)
    df_1 = df.drop(1, axis=0)
    print(df_1)
    df_2 = df.drop("A", axis="columns")
    print(df_2)


def test_many():
    df = init_data()
    print(df)
    df_1 = df.mean(axis=0)
    print(df_1)
    df_2 = df.mean(axis=1)
    print(df_2)


def get_sum_x(x):
    return x["A"] + x["B"] + x["C"] + x["D"]


def test_sum_x():
    df = init_data()
    df["sum_x"] = df.apply(get_sum_x, axis=1)
    print(df)

代碼

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
   A  B   C   D
0  0  1   2   3
2  8  9  10  11
   B   C   D
0  1   2   3
1  5   6   7
2  9  10  11

test_one結果

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
A    4.0
B    5.0
C    6.0
D    7.0
dtype: float64
0    1.5
1    5.5
2    9.5
dtype: float64

test_many結果

   A  B   C   D  sum_x
0  0  1   2   3      6
1  4  5   6   7     22
2  8  9  10  11     38

test_sum_x結果

下一篇：飲冰三年-人工智能-Pandas-77-Pandas 數據查詢

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 從事三年java開發后，我打算轉人工智能 2050年這些職業將逐漸被AI（人工智能）取代人工智能分類人工智能（目錄）人工智能簡介人工智能簡介人工智能導論筆記人工智能發展及算法 01. 什么是人工智能人工智能簡答總結