4-Pandas數據預處理之排序（df.sort_index()、df.sort_values()、隨機重排、隨機采樣）

本文轉載自查看原文 2020-07-30 16:46 1595 數據分析

排序是一種索引機制的一種常見的操作方法，也是Pandas重要的內置運算，主要包括以下3種方法：

排序方法	說明
sort_values()	根據某一列的值進行排序
sort_index()	根據索引進行排序
隨機重排	詳見后面

本節以新冠肺炎的部分數據為例（讀取“today_world_2020_04_18.csv”的國家名、時間、累計確診、累計治愈、累計死亡這5列）

一、sort_values()

注意：默認情況下sort_values()是升序排列，ascending = Fals表示降序;

sort_values()也可以對缺失值進行排序，默認情況下，缺失值是排在最后的，但是可以通過設置參數na_position='first'將缺失值排在最前；

例：>>>df.sort_values('avg_cur_bal',ascending=True,na_position='first')[:5]

import pandas as pd
import numpy as np
df = pd.read_csv('.input/today_world_2020_04_18.csv',
                 usecols=['name','lastUpdateTime','total_confirm','total_dead','total_heal'],encoding='gbk')
#使用sort_values()根據累計確診人數的進行降序重排,並列出排名前10的國家信息；
df.sort_values('total_confirm',ascending=False)[:10]

    name       lastUpdateTime  total_confirm  total_dead  total_heal
9     美國  2020-04-17 15:01:45         677146       34641       56159
168  西班牙  2020-04-18 00:00:31         188068       19478       74797
160  意大利  2020-04-18 03:16:25         172434       22745       40164
155   法國  2020-04-18 07:35:22         147969       18681       34420
8     德國  2020-04-18 07:23:10         140886        4326       83114
159   英國  2020-04-18 00:00:31         108692       14576         622
2     中國  2020-04-18 08:03:46          84176        4642       77723
14    伊朗  2020-04-18 00:00:31          79494        4958       54064
143  土耳其  2020-04-18 03:38:28          78546        1769        8631
178  比利時  2020-04-18 00:00:31          36138        5163        7961

二、sort_index()

>>> df.sort_index(ascending=False)[:5]
      name       lastUpdateTime  total_confirm  total_dead  total_heal
203  馬達加斯加  2020-04-18 07:57:30            117           0          33
202  列支敦士登  2020-04-12 00:00:31             79           1          55
201     阿曼  2020-04-18 03:28:30           1069           6         176
200   羅馬尼亞  2020-04-18 03:52:56           8067         411        1508
199   格恩西島  2020-03-27 11:33:37              1           0           0

三、隨機重排

sort_values()和sort_index()只能對DataFrame進行升序或降序排列，若希望隨機打亂排列順序（即隨機重排），方法如下：

步驟1：使用numpy.random.permutation()產生一個重排后的整數數組【注：numpy.random.permutation可隨機排列一個序列，返回一個隨機排列后的序號】

步驟2：使用.iloc[]或take()得到重排后的Pandas對象。

#步驟一：取出隨機序列
>>> import numpy as np
>>> sampler = np.random.permutation(5)
>>> sampler
array([1, 2, 3, 4, 0])

#步驟二：以步驟一得到的隨機序列為索引，取出這些數據
#take()函數
>>> df.take(sampler)
   name       lastUpdateTime  total_confirm  total_dead  total_heal
1  塞爾維亞  2020-04-18 00:00:31           5690         110         534
2    中國  2020-04-18 08:03:46          84176        4642       77723
3    日本  2020-04-18 00:00:31          10535         210        1657
4    泰國  2020-04-18 00:00:31           2700          47        1689
0   突尼斯  2020-04-18 08:09:13            864          37          43

#iloc方法
>>> df.iloc[sampler]
   name       lastUpdateTime  total_confirm  total_dead  total_heal
1  塞爾維亞  2020-04-18 00:00:31           5690         110         534
2    中國  2020-04-18 08:03:46          84176        4642       77723
3    日本  2020-04-18 00:00:31          10535         210        1657
4    泰國  2020-04-18 00:00:31           2700          47        1689
0   突尼斯  2020-04-18 08:09:13            864          37          43

四、隨機采樣

　　使用sample()進行隨機采樣,隨機采樣的量可通過參數n和frac來設置，n表示按照n指定的數量來進行抽樣，frac表示按照指定的比例進行抽樣。

注意：

　　（1）sample默認的是不放回采樣（即每個樣本只能出現一次），可以通過設置replace = True將其設置為有放回采樣；

　　　　例如：>>>df.sample(n=5,replace = True)

　　（2）若希望重復某次采樣的結果，可以設置random_state參數為同一個數來實現(random_state的大小沒有任何意思，只是這是為同一個數來通知兩次隨機采樣的結果相同)：

　　　　例如：>>>df.sample(n=5,random_state=1)

　　（3）sample也可以實現列的隨機采樣，只需要設置axis=1即可：

　　　　例如：>>>df.sample(n=2,axis=1)[:5]

>>> df.sample(3)
        name       lastUpdateTime  total_confirm  total_dead  total_heal
163       芬蘭  2020-04-18 02:51:06           3489          82        1700
144  巴布亞新幾內亞  2020-04-08 00:00:31              2           0           0
94       索馬里  2020-04-16 07:33:23             80           5           2

使用sample()也可以實現重排

>>> df.sample(len(df))[:5]
       name       lastUpdateTime  total_confirm  total_dead  total_heal
99       馬里  2020-04-18 03:33:59            190          13          34
12    聖巴泰勒米  2020-03-27 11:18:38              3           0           0
117  吉爾吉斯斯坦  2020-04-18 00:00:31            489           5         114
88     斯威士蘭  2020-04-18 07:55:57             19           1           8
178     比利時  2020-04-18 00:00:31          36138        5163        7961

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 4-Pandas數據預處理之數據轉換（df.map()、df.replace()） pandas的排序、排名函數——sort_index()、sort_values()、rank() 8 Dataframe 排序（sort_index()和sort_values()） pandas sort_values 排序后， index 也發生了改變，不改變的情況下需要 reset_index(drop = True) Pandas隨機采樣 sort_values()和sort_index()函數 python進行數據預處理-pandas pandas 數據預處理實例演示 Python實驗五：Pandas數據分析及數據預處理 009 Linux 文件大小統計與排序( du於df和sort)