快速上手pandas(上)

本文轉載自查看原文 2021-06-20 20:07 424 Python

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas是一個靈活而強大的數據處理與數據分析工具集。它高度封裝了NumPy(高性能的N維數組運算庫)、Matplotlib(可視化工具)、文件讀寫等等，廣泛應用於數據清洗、數據分析、數據挖掘等場景。

官網：https://pandas.pydata.org/

文檔：https://pandas.pydata.org/docs/

對NumPy完全不了解的朋友，建議翻閱前文：

https://www.cnblogs.com/bytesfly/p/numpy.html

In [1]:

 
              # 這里先導入下面會頻繁使用到的模塊
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 或者pd.show_versions()
pd.__version__
 
             

Out[1]:

'1.1.3'

剛接觸Python的朋友可能不知道help()命令能隨時查看幫助文檔, 這里順便提一下：

In [2]:

 
              # help(np.random)
# help(pd)
# help(plt)
# help(pd.DataFrame)

# help參數也可以傳入 實例對象的方法
# df = pd.DataFrame(np.random.randint(50, 100, (6, 5)))
# help(df.to_csv)
 
             

pandas數據結構

pandas中有三種數據結構，分別為：Series(一維數據結構)、DataFrame(二維表格型數據結構)和MultiIndex(三維數據結構)。

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Series是一個類似於一維數組的數據結構，主要由一組數據(data)和與之相關的索引(index)兩部分構成。如下圖所示：

下面看如何創建Series：

不指定index，使用默認index(0-N)

In [3]:

 
              s1 = pd.Series([12, 4, 7, 9])
s1

Out[3]:

0    12
1     4
2     7
3     9
dtype: int64

通過index來獲取數據：

In [4]:

 
              s1[0]

Out[4]:

In [5]:

 
              s1[3]

Out[5]:

指定index

In [6]:

 
              s2 = pd.Series([12, 4, 7, 9], index=["a", "b", "c", "d"])
s2

Out[6]:

a    12
b     4
c     7
d     9
dtype: int64

通過index來獲取數據：

In [7]:

 
              s2['a']

Out[7]:

In [8]:

 
              s2['d']

Out[8]:

Series can be instantiated from dicts

In [9]:

 
              s3 = pd.Series({"d": 12, "c": 4, "b": 7, "a": 9})
s3

Out[9]:

d    12
c     4
b     7
a     9
dtype: int64

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and pandas version >= 0.23.

通過index來獲取數據：

In [10]:

 
              s3['d']

Out[10]:

In [11]:

 
              s3['a']

Out[11]:

Series也提供了兩個屬性index和values：

In [12]:

 
              s1.index

Out[12]:

RangeIndex(start=0, stop=4, step=1)

In [13]:

 
              s2.index

Out[13]:

Index(['a', 'b', 'c', 'd'], dtype='object')

In [14]:

 
              s3.index

Out[14]:

Index(['d', 'c', 'b', 'a'], dtype='object')

In [15]:

 
              s3.values

Out[15]:

array([12,  4,  7,  9])

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [16]:

 
              pd.Series({"a": 0.0, "b": 1.0, "c": 2.0}, index=["b", "c", "d", "a"])

Out[16]:

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

NaN (not a number) is the standard missing data marker used in pandas.

注意：這里的NaN是一個缺值標識。

如果data是一個標量，那么必須要傳入index：

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [17]:

 
              pd.Series(5.0, index=["c", "a", "b"])

Out[17]:

c    5.0
a    5.0
b    5.0
dtype: float64

其實上面的創建相當於：

In [18]:

 
              pd.Series([5.0, 5.0, 5.0], index=["c", "a", "b"])

Out[18]:

c    5.0
a    5.0
b    5.0
dtype: float64

ndarray上的一些操作對Series同樣適用：

Series is ndarray-like. Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [19]:

 
              # 取s1前3個元素
s1[:3]

Out[19]:

0    12
1     4
2     7
dtype: int64

In [20]:

 
              # 哪些元素大於10
s1 > 10

Out[20]:

0     True
1    False
2    False
3    False
dtype: bool

更多操作請翻閱上文對NumPy中的ndarray的講解。見：https://www.cnblogs.com/bytesfly/p/numpy.html

In [21]:

 
              s1.dtype

Out[21]:

dtype('int64')

If you need the actual array backing a Series, use Series.array

In [22]:

 
              # Series轉為array
s1.array

Out[22]:

<PandasArray>
[12, 4, 7, 9]
Length: 4, dtype: int64

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy()

In [23]:

 
              # Series轉為ndarray
s1.to_numpy()

Out[23]:

array([12,  4,  7,  9])

dict上的一些操作對Series也同樣適用：

Series is dict-like. A Series is like a fixed-size dict in that you can get and set values by index label

In [24]:

 
              s3['a'] = 100
s3

Out[24]:

d     12
c      4
b      7
a    100
dtype: int64

In [25]:

 
              "a" in s3

Out[25]:

True

In [26]:

 
              "y" in s3

Out[26]:

False

In [27]:

 
              # 獲取不到給默認值NAN
s3.get("y", np.nan)

Out[27]:

nan

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

DataFrame是一個類似於二維數組或表的對象，既有行索引，又有列索引。如下圖所示：

行索引(或者叫行標簽)，表明不同行，橫向索引，叫index，0軸，axis=0
列索引(或者叫列標簽)，表名不同列，縱向索引，叫columns，1軸，axis=1

下面看如何創建DataFrame：

不指定行、列標簽，默認使用0-N索引

In [28]:

 
               # 隨機生成6名學生，5門課程的分數
score = np.random.randint(50, 100, (6, 5))

# 創建DataFrame
pd.DataFrame(score)

Out[28]:

	0	1	2	3	4
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

指定行、列標簽

In [29]:

 
               # 列標簽
subjects = ["語文", "數學", "英語", "物理", "化學"]

# 行標簽
stus = ['學生' + str(i+1) for i in range(score.shape[0])]

# 創建DataFrame
score_df = pd.DataFrame(score, columns=subjects, index=stus)

score_df

Out[29]:

	語文	數學	英語	物理	化學
學生1	69	90	56	97	79
學生2	57	98	70	57	82
學生3	63	66	98	78	63
學生4	74	58	75	57	68
學生5	94	78	83	72	73
學生6	60	73	62	72	79

加了行列標簽后，顯然數據可讀性更強了，一目了然。

同樣再看DataFrame的幾個基本屬性：

In [30]:

 
              score_df.shape

Out[30]:

(6, 5)

In [31]:

 
              score_df.columns

Out[31]:

Index(['語文', '數學', '英語', '物理', '化學'], dtype='object')

In [32]:

 
              score_df.index

Out[32]:

Index(['學生1', '學生2', '學生3', '學生4', '學生5', '學生6'], dtype='object')

In [33]:

 
              score_df.values

Out[33]:

array([[69, 90, 56, 97, 79],
       [57, 98, 70, 57, 82],
       [63, 66, 98, 78, 63],
       [74, 58, 75, 57, 68],
       [94, 78, 83, 72, 73],
       [60, 73, 62, 72, 79]])

In [34]:

 
               # 轉置
score_df.T

Out[34]:

	學生1	學生2	學生3	學生4	學生5	學生6
語文	69	57	63	74	94	60
數學	90	98	66	58	78	73
英語	56	70	98	75	83	62
物理	97	57	78	57	72	72
化學	79	82	63	68	73	79

In [35]:

 
               # 顯示前3行內容
score_df.head(3)

Out[35]:

	語文	數學	英語	物理	化學
學生1	69	90	56	97	79
學生2	57	98	70	57	82
學生3	63	66	98	78	63

In [36]:

 
               # 顯示后3行內容
score_df.tail(3)

Out[36]:

	語文	數學	英語	物理	化學
學生4	74	58	75	57	68
學生5	94	78	83	72	73
學生6	60	73	62	72	79

修改行標簽：

In [37]:

 
               stus = ['stu' + str(i+1) for i in range(score.shape[0])]

score_df.index = stus

score_df

Out[37]:

	語文	數學	英語	物理	化學
stu1	69	90	56	97	79
stu2	57	98	70	57	82
stu3	63	66	98	78	63
stu4	74	58	75	57	68
stu5	94	78	83	72	73
stu6	60	73	62	72	79

重置行標簽：

In [38]:

 
               # drop默認為False,不刪除原來的索引值
score_df.reset_index()

Out[38]:

	index	語文	數學	英語	物理	化學
0	stu1	69	90	56	97	79
1	stu2	57	98	70	57	82
2	stu3	63	66	98	78	63
3	stu4	74	58	75	57	68
4	stu5	94	78	83	72	73
5	stu6	60	73	62	72	79

In [39]:

 
               # drop為True, 則刪除原來的索引值
score_df.reset_index(drop=True)

Out[39]:

	語文	數學	英語	物理	化學
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

In [40]:

score_df

Out[40]:

	語文	數學	英語	物理	化學
stu1	69	90	56	97	79
stu2	57	98	70	57	82
stu3	63	66	98	78	63
stu4	74	58	75	57	68
stu5	94	78	83	72	73
stu6	60	73	62	72	79

將某列值設置為新的索引：

set_index(keys, drop=True)

keys : 列索引名成或者列索引名稱的列表
drop : boolean, default True.當做新的索引，刪除原來的列

In [41]:

 
               hero_df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
                        'name': ['李尋歡', '令狐沖', '張無忌', '郭靖', '花無缺'],
                        'book': ['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '射雕英雄傳', '絕代雙驕'],
                        'skill': ['小李飛刀', '獨孤九劍', '九陽神功', '降龍十八掌', '移花接玉']})

hero_df.set_index('id')

Out[41]:

	name	book	skill
id
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

設置多個索引，以id和name：

In [42]:

 
               df = hero_df.set_index(['id', 'name'])
df

Out[42]:

		book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

In [43]:

 
              df.index

Out[43]:

MultiIndex([(1, '李尋歡'),
            (2, '令狐沖'),
            (3, '張無忌'),
            (4,  '郭靖'),
            (5, '花無缺')],
           names=['id', 'name'])

此時df就是一個具有MultiIndex的DataFrame。

MultiIndex

In [44]:

 
              df.index.names

Out[44]:

FrozenList(['id', 'name'])

In [45]:

 
              df.index.levels

Out[45]:

FrozenList([[1, 2, 3, 4, 5], ['令狐沖', '張無忌', '李尋歡', '花無缺', '郭靖']])

數據操作與運算

In [46]:

df

Out[46]:

		book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

索引操作

直接使用行列索引(先列后行)：

In [47]:

 
              df['skill']

Out[47]:

id  name
1   李尋歡      小李飛刀
2   令狐沖      獨孤九劍
3   張無忌      九陽神功
4   郭靖      降龍十八掌
5   花無缺      移花接玉
Name: skill, dtype: object

In [48]:

 
              df['skill'][1]

Out[48]:

name
李尋歡    小李飛刀
Name: skill, dtype: object

In [49]:

 
              df['skill'][1]['李尋歡']

Out[49]:

'小李飛刀'

使用loc(指定行列索引的名字)

In [50]:

 
               df.loc[1:3]

Out[50]:

		book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功

In [51]:

 
               df.loc[(2, '令狐沖'):(4, '郭靖')]

Out[51]:

		book	skill
id	name
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌

In [52]:

 
              df.loc[1:3, 'book']

Out[52]:

id  name
1   李尋歡     多情劍客無情劍
2   令狐沖        笑傲江湖
3   張無忌       倚天屠龍記
Name: book, dtype: object

In [53]:

 
               df.loc[df.index[1:3], ['book', 'skill']]

Out[53]:

		book	skill
id	name
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功

使用iloc(通過索引的下標)

In [54]:

 
               # 獲取前2行數據
df.iloc[:2]

Out[54]:

		book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍

In [55]:

 
               df.iloc[0:2, df.columns.get_indexer(['skill'])]

Out[55]:

		skill
id	name
1	李尋歡	小李飛刀
2	令狐沖	獨孤九劍

賦值操作

In [56]:

 
               # 添加新列
df['score'] = 100
df['gender'] = 'male'

df

Out[56]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
5	花無缺	絕代雙驕	移花接玉	100	male

In [57]:

 
               # 修改列值
df['score'] = 99

df

Out[57]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	99	male
2	令狐沖	笑傲江湖	獨孤九劍	99	male
3	張無忌	倚天屠龍記	九陽神功	99	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	99	male

In [58]:

 
               # 或者這樣修改列值
df.score = 100

df

Out[58]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
5	花無缺	絕代雙驕	移花接玉	100	male

排序

sort_index

In [59]:

 
               # 按索引降序
df.sort_index(ascending=False)

Out[59]:

		book	skill	score	gender
id	name
5	花無缺	絕代雙驕	移花接玉	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
1	李尋歡	多情劍客無情劍	小李飛刀	100	male

sort_values

先把score設置為不同的值：

In [60]:

 
               df['score'][1]['李尋歡'] = 80
df['score'][2]['令狐沖'] = 96
df['score'][3]['張無忌'] = 86
df['score'][4]['郭靖'] = 99
df['score'][5]['花無缺'] = 95

df
 
              

Out[60]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

In [61]:

 
               # 按照score降序
df.sort_values(by='score', ascending=False)

Out[61]:

		book	skill	score	gender
id	name
4	郭靖	射雕英雄傳	降龍十八掌	99	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
5	花無缺	絕代雙驕	移花接玉	95	male
3	張無忌	倚天屠龍記	九陽神功	86	male
1	李尋歡	多情劍客無情劍	小李飛刀	80	male

In [62]:

 
               # 按照book名稱字符串長度 升序
df.sort_values(by='book', key=lambda col: col.str.len())

Out[62]:

		book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
5	花無缺	絕代雙驕	移花接玉	95	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
1	李尋歡	多情劍客無情劍	小李飛刀	80	male

算術運算

In [63]:

 
              # score+1
df['score'].add(1)

Out[63]:

id  name
1   李尋歡      81
2   令狐沖      97
3   張無忌      87
4   郭靖      100
5   花無缺      96
Name: score, dtype: int64

In [64]:

 
              # score-1
df['score'].sub(1)

Out[64]:

id  name
1   李尋歡     79
2   令狐沖     95
3   張無忌     85
4   郭靖      98
5   花無缺     94
Name: score, dtype: int64

In [65]:

 
              # 或者直接用 + - * / // %等運算符
(df['score'] + 1) % 10

Out[65]:

id  name
1   李尋歡     1
2   令狐沖     7
3   張無忌     7
4   郭靖      0
5   花無缺     6
Name: score, dtype: int64

邏輯運算

先回顧一下數據內容：

In [66]:

df

Out[66]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

邏輯運算結果：

In [67]:

 
              df['score'] > 90

Out[67]:

id  name
1   李尋歡     False
2   令狐沖      True
3   張無忌     False
4   郭靖       True
5   花無缺      True
Name: score, dtype: bool

In [68]:

 
               # 篩選出分數大於90的行
df[df['score'] > 90]

Out[68]:

		book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

In [69]:

 
               # 篩選出分數在85到90之間的行
df[(df['score'] > 85) & (df['score'] < 90)]

Out[69]:

		book	skill	score	gender
id	name
3	張無忌	倚天屠龍記	九陽神功	86	male

In [70]:

 
               # 篩選出分數在85以下或者95以上的行
df[(df['score'] < 85) | (df['score'] > 95)]

Out[70]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

或者通過query()函數實現上面的需求：

In [71]:

 
               df.query("score<85 | score>95")

Out[71]:

		book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

此外，可以用isin(values)函數來篩選指定的值，類似於SQL中in查詢：

In [72]:

 
               df[df["score"].isin([99, 96])]

Out[72]:

		book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

統計運算

describe()能夠直接得出很多統計結果：count, mean, std, min, max 等：

In [73]:

 
               df.describe()

Out[73]:

	score
count	5.000000
mean	91.200000
std	7.918333
min	80.000000
25%	86.000000
50%	95.000000
75%	96.000000
max	99.000000

In [74]:

 
              # 使用統計函數：axis=0代表求列統計結果，1代表求行統計結果
df.max(axis=0, numeric_only=True)

Out[74]:

score    99
dtype: int64

其他幾個常用的聚合函數都類似。不再一一舉例。

下面重點看下累計統計函數。

函數	作用
cumsum	計算前1/2/3/…/n個數的和
cummax	計算前1/2/3/…/n個數的最大值
cummin	計算前1/2/3/…/n個數的最小值
cumprod	計算前1/2/3/…/n個數的積

下面是某公司近半年以來的各部門的營業收入數據：

In [75]:

 
               income = pd.DataFrame(data=np.random.randint(60, 100, (6, 5)),
                      columns=['group' + str(x) for x in range(1, 6)],
                      index=['Month' + str(x) for x in range(1, 7)])

income

Out[75]:

	group1	group2	group3	group4	group5
Month1	97	89	62	82	71
Month2	68	69	82	66	79
Month3	77	87	66	94	82
Month4	69	76	99	79	61
Month5	77	94	76	70	70
Month6	89	64	92	63	60

統計group1的前N個月的總營業收入：

In [76]:

 
              group1_income = income['group1']

group1_income.cumsum()

Out[76]:

Month1     97
Month2    165
Month3    242
Month4    311
Month5    388
Month6    477
Name: group1, dtype: int64

用圖形展示會更加直觀：

In [77]:

 
              group1_income.cumsum().plot(figsize=(8, 5))

plt.show()

同理，統計group1的前N個月的最大營業收入：

In [78]:

 
              group1_income.cummax().plot(figsize=(8, 5))

plt.show()

其他運算

先看下近半年以來前3個部門的營業收入數據：

In [79]:

 
               income[['group1', 'group2', 'group3']]

Out[79]:

	group1	group2	group3
Month1	97	89	62
Month2	68	69	82
Month3	77	87	66
Month4	69	76	99
Month5	77	94	76
Month6	89	64	92

In [80]:

 
              # 近半年 前3個部門 每月營業收入極差
income[['group1', 'group2', 'group3']].apply(lambda x: x.max() - x.min(), axis=1)

Out[80]:

Month1    35
Month2    14
Month3    21
Month4    30
Month5    18
Month6    28
dtype: int64

In [81]:

 
              # 近半年 前3個部門 每個部門營業收入極差
income[['group1', 'group2', 'group3']].apply(lambda x: x.max() - x.min(), axis=0)

Out[81]:

group1    29
group2    30
group3    37
dtype: int64

文件讀寫

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.to_csv(). Below is a table containing available readers and writers.

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

CSV

這里用下面網址的csv數據來做一些測試：
https://www.stats.govt.nz/large-datasets/csv-files-for-download/

In [82]:

 
               path = "https://www.stats.govt.nz/assets/Uploads/Employment-indicators/Employment-indicators-Weekly-as-at-24-May-2021/Download-data/Employment-indicators-weekly-paid-jobs-20-days-as-at-24-May-2021.csv"
# 讀csv
data = pd.read_csv(path, sep=',', usecols=['Week_end', 'High_industry', 'Value'])
data

Out[82]:

	Week_end	High_industry	Value
0	2019-05-05	Total	1828160.00
1	2019-05-05	A Primary	79880.00
2	2019-05-05	B Goods Producing	344320.00
3	2019-05-05	C Services	1389220.00
4	2019-05-05	Z No Match	14730.00
...	...	...	...
2095	2021-05-02	Total	700.94
2096	2021-05-02	A Primary	680.20
2097	2021-05-02	B Goods Producing	916.42
2098	2021-05-02	C Services	649.92
2099	2021-05-02	Z No Match	425.65

2100 rows × 3 columns

In [83]:

 
              # 寫csv
data[:20].to_csv("./test.csv", columns=['Week_end', 'Value'],
                 header=True, index=False, mode='w')

輸出如下：

Week_end,Value
2019-05-05,1828160.0
2019-05-05,79880.0
2019-05-05,344320.0

JSON

更多Json格式數據，google關鍵詞site:api.androidhive.info

In [84]:

 
               path = "https://api.androidhive.info/json/movies.json"
# 讀json
data = pd.read_json(path)
data = data.loc[:2, ['title', 'rating']]
data
 
              

Out[84]:

	title	rating
0	Dawn of the Planet of the Apes	8.3
1	District 9	8.0
2	Transformers: Age of Extinction	6.3

records

In [85]:

 
              data.to_json("./test.json", orient='records', lines=True)

輸出如下：

{"title":"Dawn of the Planet of the Apes","rating":8.3}
{"title":"District 9","rating":8.0}
{"title":"Transformers: Age of Extinction","rating":6.3}

如果lines=False，即：

In [86]:

 
              data.to_json("./test.json", orient='records', lines=False)

輸出如下：

[
    {
        "title":"Dawn of the Planet of the Apes",
        "rating":8.3
    },
    {
        "title":"District 9",
        "rating":"8.0"
    },
    {
        "title":"Transformers: Age of Extinction",
        "rating":6.3
    }
]

columns

In [87]:

 
              data.to_json("./test.json", orient='columns')

輸出如下：

{
    "title":{
        "0":"Dawn of the Planet of the Apes",
        "1":"District 9",
        "2":"Transformers: Age of Extinction"
    },
    "rating":{
        "0":8.3,
        "1":"8.0",
        "2":6.3
    }
}

index

In [88]:

 
              data.to_json("./test.json", orient='index')

輸出如下：

{
    "0":{
        "title":"Dawn of the Planet of the Apes",
        "rating":8.3
    },
    "1":{
        "title":"District 9",
        "rating":"8.0"
    },
    "2":{
        "title":"Transformers: Age of Extinction",
        "rating":6.3
    }
}

split

In [89]:

 
              data.to_json("./test.json", orient='split')

輸出如下：

{
    "columns":[
        "title",
        "rating"
    ],
    "index":[
        0,
        1,
        2
    ],
    "data":[
        [
            "Dawn of the Planet of the Apes",
            8.3
        ],
        [
            "District 9",
            "8.0"
        ],
        [
            "Transformers: Age of Extinction",
            6.3
        ]
    ]
}

values

In [90]:

 
              data.to_json("./test.json", orient='values')

輸出如下：

[
    [
        "Dawn of the Planet of the Apes",
        8.3
    ],
    [
        "District 9",
        "8.0"
    ],
    [
        "Transformers: Age of Extinction",
        6.3
    ]
]

Excel

In [91]:

 
               # 讀Excel
team = pd.read_excel('https://www.gairuo.com/file/data/dataset/team.xlsx')

team.head(5)

Out[91]:

	name	team	Q1	Q2	Q3	Q4
0	Liver	E	89	21	24	64
1	Arry	C	36	37	37	57
2	Ack	A	57	60	18	84
3	Eorge	C	93	96	71	78
4	Oah	D	65	49	61	86

In [92]:

 
               # 末尾添加一列sum, 值為Q1、Q2、Q3、Q4列的和
team['sum'] = team['Q1'] + team['Q2'] + team['Q3'] + team['Q4']

team.head(5)

Out[92]:

	name	team	Q1	Q2	Q3	Q4	sum
0	Liver	E	89	21	24	64	198
1	Arry	C	36	37	37	57	167
2	Ack	A	57	60	18	84	219
3	Eorge	C	93	96	71	78	338
4	Oah	D	65	49	61	86	261

In [93]:

 
              # 寫Excel
team.to_excel('test.xlsx', index=False)

HDF5

HDF5（Hierarchical Data Format）是用於存儲大規模數值數據的較為理想的存儲格式。

優勢：

HDF5在存儲的時候支持壓縮，從而提磁盤利用率，節省空間
HDF5跨平台的，可輕松遷移到Hadoop上

In [94]:

 
               score = pd.DataFrame(np.random.randint(50, 100, (100000, 10)))

score.head()

Out[94]:

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

In [95]:

 
              # 寫HDF5
score.to_hdf("./score.h5", key="score", complevel=9, mode='w')

In [96]:

 
               # 讀HDF5
new_score = pd.read_hdf("./score.h5", key="score")

new_score.head()

Out[96]:

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

注意：HDF5文件的讀取和存儲需要指定一個鍵(key)，值為要存儲的DataFrame。

In [97]:

 
              score.to_csv("./score.csv", mode='w')

同時，來對比一下寫HDF5與寫csv的占用磁盤情況：

-rw-r--r--  1 wind  staff   3.4M  6 13 10:24 score.csv
-rw-r--r--  1 wind  staff   763K  6 13 10:23 score.h5

其他格式的文件讀寫也都類似，這里不再舉例說明。

畫圖

下面各圖表的適用場景參考：

https://antv-2018.alipay.com/zh-cn/vis/chart/index.html

http://tuzhidian.com/

Pandas的DataFrame和Series，在matplotlib基礎上封裝了一個簡易的繪圖函數, 使得在數據處理過程中可以方便快速可視化數據。

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html

折線圖 line

折線圖用於分析事物隨時間或有序類別而變化的趨勢。

先來個快速入門的案例，繪制sin(x)的一個周期內(0, 2π)的函數圖像：

In [98]:

 
              # 中文字體
plt.rc('font', family='Arial Unicode MS')
plt.rc('axes', unicode_minus='False')

x = np.arange(0, 6.29, 0.01)
y = np.sin(x)

s = pd.Series(y, index=x)
s.plot(kind='line', title='sin(x)圖像',
       style='--g', grid=True, figsize=(8, 6))

plt.show()
 
             

注意：中文顯示有問題的，可用如下代碼查看系統可用字體。

In [99]:

 
              from matplotlib.font_manager import FontManager
# fonts = set([x.name for x in FontManager().ttflist])
# print(fonts)

如果要繪制DataFrame數據的某2列數據的折線圖，傳入x='列1'，y='列2'，就能得到以'列1'為x軸，'列2'為y軸的線型圖。如果沒有指明x，x軸默認用index。如下例子：

In [100]:

 
              df = pd.DataFrame({'x': x, 'y': y})
df.plot(x='x', y='y', kind='line', ylabel='y=sin(x)',
        style='--g', grid=True, figsize=(8, 6))

plt.show()

再看一個實際的案例。下面是北京、上海、合肥三個城市某天中午氣溫的數據。

In [101]:

 
               x = [f'12點{i}分' for i in range(60)]

y_beijing = np.random.uniform(18, 23, len(x))
y_shanghai = np.random.uniform(23, 26, len(x))
y_hefei = np.random.uniform(21, 28, len(x))

df = pd.DataFrame({'x': x, 'beijing': y_beijing, 'shanghai': y_shanghai, 'hefei': y_hefei})

df.head(5)

Out[101]:

	x	beijing	shanghai	hefei
0	12點0分	21.516230	24.627902	25.232748
1	12點1分	22.601501	25.982331	26.403320
2	12點2分	19.739455	23.602787	25.569943
3	12點3分	21.741182	25.102164	24.400619
4	12點4分	19.888968	23.995114	22.879671

繪制北京、上海、合肥三個城市氣溫隨時間變化的情況如下：

In [102]:

 
              df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

添加參數subplots=True, 可以形成多個子圖，如下：

In [103]:

 
              df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        subplots=True, figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

另外，參數layout=(m,n)可以指明子圖的行列數，期中m*n的值要大於子圖的數量，如下：

In [104]:

 
              df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        subplots=True, layout=(2, 2), figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

柱狀圖 bar

柱狀圖最適合對分類的數據進行比較。

武林大會上，每個英雄參加10個回合的比拼，每人勝局(wins)、平局(draws)、敗局(losses)的統計如下：

In [105]:

 
               columns = ['hero', 'wins', 'draws', 'losses']

score = [
    ['李尋歡', 6, 1, 3],
    ['令狐沖', 5, 4, 1],
    ['張無忌', 5, 3, 2],
    ['郭靖', 4, 5, 1],
    ['花無缺', 5, 2, 3]
]

df = pd.DataFrame(score, columns=columns)

df
 
              

Out[105]:

	hero	wins	draws	losses
0	李尋歡	6	1	3
1	令狐沖	5	4	1
2	張無忌	5	3	2
3	郭靖	4	5	1
4	花無缺	5	2	3

In [106]:

 
              df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        rot=0, title='柱狀圖', xlabel='', figsize=(10, 6))
plt.show()

添加stacked=True可以繪制堆疊柱狀圖，如下：

In [107]:

 
              df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        stacked=True, rot=0, title='堆疊柱狀圖', xlabel='', figsize=(10, 6))
plt.show()

另外使用plot(kind='barh')或者plot.barh()可以繪制水平柱狀圖。下面rot參數設置x標簽的旋轉角度，alpha設置透明度，align設置對齊位置。

In [108]:

 
              df.plot(kind='barh', x='hero', y=['wins'], xlabel='',
        align='center', alpha=0.8, rot=0, title='水平柱狀圖', figsize=(10, 6))
plt.show()

同樣添加參數subplots=True，可以形成多個子圖，如下：

In [109]:

 
              df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        subplots=True, rot=0, xlabel='', figsize=(10, 8))
plt.show()

餅圖 pie

餅圖最顯著的功能在於表現“占比”。

每當某些機構或者平台發布編程語言排行榜以及市場占有率時，相信行業內的很多朋友會很自然地瞄上幾眼。

這里舉個某互聯網公司研發部使用的后端編程語言占比：

In [110]:

 
              colors = ['#FF6600', '#0099FF', '#FFFF00', '#FF0066', '#339900']
language = ['Java', 'Python', 'Golang', 'Scala', 'Others']

s = pd.Series([0.5, 0.3, 0.1, 0.06, 0.04], name='language',
              index=language)

s.plot(kind='pie', figsize=(7, 7),
       autopct="%.0f%%", colors=colors)
plt.show()

散點圖 scatter

散點圖適用於分析變量之間是否存在某種關系或相關性。

先造一些隨機數據，如下：

In [111]:

 
               x = np.random.uniform(0, 100, 100)
y = [2*n for n in x]

df = pd.DataFrame({'x': x, 'y': y})
df.head(5)

Out[111]:

	x	y
0	10.405567	20.811134
1	24.520765	49.041530
2	32.735258	65.470516
3	90.868823	181.737646
4	21.875188	43.750377

繪制散點圖，如下：

In [112]:

 
              df.plot(kind='scatter', x='x', y='y', figsize=(12, 6))
plt.show()

從圖中，可以看出y與x可能存在正相關的關系。

直方圖 hist

直方圖用於表示數據的分布情況。一般用橫軸表示數據區間，縱軸表示分布情況，柱子越高，則落在該區間的數量越大。

構建直方圖，首先要確定“組距”、對數值的范圍進行分區，通俗的說即是划定有幾根柱子（例如0-100分，每隔20分划一個區間，共5個區間）。接着，對落在每個區間的數值進行頻次計算（如落在80-100分的10人，60-80分的20人，以此類推）。最后，繪制矩形，高度由頻數決定。

注意：直方圖並不等於柱狀圖，不能對離散的分類數據進行比較。

在前文快速上手NumPy 中，簡單講過正態分布，也畫過正態分布的直方圖。下面看用pandas如何畫直方圖：

已知某地區成年男性身高近似服從正態分布。下面生成均值為170，標准差為5的100000個符合正態分布規律的樣本數據。

In [113]:

 
               height = np.random.normal(170, 5, 100000)

df = pd.DataFrame({'height': height})
df.head(5)

Out[113]:

	height
0	166.547946
1	166.847060
2	166.887866
3	175.607073
4	181.527058

繪制直方圖，其中分組數為100：

In [114]:

 
              df.plot(kind='hist', bins=100, figsize=(10, 5))

plt.grid(True, linestyle='--', alpha=0.8)
plt.show()

從圖中可以直觀地看出，大多人身高集中在170左右。

箱形圖 box

箱形圖多用於數值統計，它不需要占據過多的畫布空間，空間利用率高，非常適用於比較多組數據的分布情況。通過箱形圖，可以很快知道一些關鍵的統計值，如最大值、最小值、中位數、上下四分位數等等。

某班級30個學生在期末考試中，語文、數學、英語、物理、化學5門課的成績數據如下：

In [115]:

 
               count = 30

chinese_score = np.random.normal(80, 10, count)
maths_score = np.random.normal(85, 20, count)
english_score = np.random.normal(70, 25, count)
physics_score = np.random.normal(65, 30, count)
chemistry_score = np.random.normal(75, 5, count)

scores = pd.DataFrame({'Chinese': chinese_score, 'Maths': maths_score, 'English': english_score,
                       'Physics': physics_score, 'Chemistry': chemistry_score})

scores.head(5)
 
              

Out[115]:

	Chinese	Maths	English	Physics	Chemistry
0	81.985202	77.139479	78.881483	98.688823	75.849040
1	71.597557	69.960533	98.784664	72.140422	73.179419
2	80.581071	75.501743	99.491803	18.579709	67.963091
3	82.396994	113.018430	83.224544	38.406359	75.713590
4	83.153492	103.378598	59.399535	78.381277	77.020693

繪制箱形圖如下：

In [116]:

 
              scores.plot(kind='box', figsize=(10, 6))

plt.grid(True, linestyle='--', alpha=1)
plt.show()

添加參數vert=False將箱形圖橫向展示：

In [117]:

 
              scores.plot(kind='box', vert=False, figsize=(10, 5))

plt.grid(True, linestyle='--', alpha=1)
plt.show()

添加參數subplots=True，可以形成多個子圖：

In [118]:

 
              scores.plot(kind='box', y=['Chinese', 'Maths', 'English', 'Physics'],
            subplots=True, layout=(2, 2), figsize=(10, 8))

plt.show()

面積圖 area

面積圖，或稱區域圖，是一種隨有序變量的變化，反映數值變化的統計圖表，原理與折線圖相似。而面積圖的特點在於，折線與自變量坐標軸之間的區域，會由顏色或者紋理填充。

適用場景：

在連續自變量下，一組或多組數據的趨勢變化以及相互之間的對比，同時也能夠觀察到數據總量的變化趨勢。

例如，位移 = 速度 x 時間：即s=v*t; 那么x 軸是時間 t，y 軸是每個時刻的速度 v，使用面積圖，不僅可以觀察速度隨時間變化的趨勢，還可以根據面積大小來感受位移距離的長度變化。

秋名山某段直線賽道上，AE86與GC8在60秒時間內的車速與時間的變化數據如下：

In [119]:

 
               t = list(range(60))

v_AE86 = np.random.uniform(180, 210, len(t))
v_GC8 = np.random.uniform(190, 230, len(t))

v = pd.DataFrame({'t': t, 'AE86': v_AE86, 'GC8': v_GC8})

v.head(5)

Out[119]:

	t	AE86	GC8
0	0	198.183060	215.830409
1	1	190.186453	195.316343
2	2	180.464073	210.641824
3	3	194.842767	219.794681
4	4	194.620050	204.215492

面積圖默認情況下是堆疊的。

In [120]:

 
              v.plot(kind='area', x='t', y=['AE86', 'GC8'], figsize=(10, 5))
plt.show()

要生成未堆積的圖，可以傳入參數stacked=False：

In [121]:

 
              v.plot(kind='area', x='t', y=['AE86', 'GC8'],
       stacked=False, figsize=(10, 5), alpha=0.4)
plt.show()

從圖形與x軸圍成的面積看，很顯然該60秒內，GC8是略領先AE86的。溫馨提示，文明駕駛，切勿飆車！

同樣，添加參數subplots=True，可以形成多個子圖，如下：

In [122]:

 
              v.plot(kind='area', x='t', y=['AE86', 'GC8'],
       subplots=True, figsize=(10, 5), rot=0)
plt.show()

總結

本文是快速上手pandas的上篇，先介紹了為什么選用pandas，接下來介紹pandas的常用數據結構Series、DataFrame以及一系列的操作與運算，然后嘗試了用pandas來讀寫文件，最后重點介紹了如何使用pandas快速便捷地可視化數據。

在快速上手pandas的下篇中，會涉及如何用pandas來進行數據清洗、數據合並、分組聚合、數據透視、文本處理等，敬請期待。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 快速上手pandas(下) Pandas快速上手（一）：基本操作 Nest快速上手 [如何快速上手對拍] 如何快速上手LayUI uniapp快速上手快速上手Vue AutoMapper快速上手 pigx快速上手快速上手NumPy

	0	1	2	3	4
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	語文	數學	英語	物理	化學
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

	0	1	2	3	4
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	語文	數學	英語	物理	化學
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

	0	1	2	3	4
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	語文	數學	英語	物理	化學
0	69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

	0	1	2	3	4	5	6	7	8	9
0	76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93