Pandas 的數據結構 DataFrame 的常用方法

本文轉載自查看原文 2019-04-24 12:02 3724 pandas/ python

總結的方法所用實例為 sklearn&tensorflow機器學習使用指南第二章中的房屋價格投資預測項目

housing = pd.read_csv("housing.csv")

head() 方法

用於查看數據集的前5行

print(housing.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value ocean_proximity
0    -122.23     37.88                41.0        880.0           129.0       322.0       126.0         8.3252            452600.0 NEAR BAY 1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

info() 方法

用於快速查看數據的描述，例如總行數，每個屬性的類型以及非空值的數量

print(housing.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

value_counts() 方法

查看某一列數據中都有哪些類別，以及每個類別中數據的數量。我們使用 ocean_proximity 這一列的數據，看出一共有5類，INLAND 這一類數量最多為6551

print(housing["ocean_proximity"].value_counts())
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64

describe() 方法

顯示數值屬性的概括，count 是數據的數量；mean, min, max 分別表示平均值，最小以及最大值；std 是標准差，用來揭示數據分散度；25% 50% 75% 對應分位數，比如 25% 的房子年齡中位數小於18，而50%的小於29。

print(housing.describe())
          longitude      latitude  housing_median_age   total_rooms  total_bedrooms    population    households  median_income  median_house_value
count  20640.000000  20640.000000        20640.000000  20640.000000    20433.000000  20640.000000  20640.000000   20640.000000        20640.000000
mean    -119.569704     35.631861           28.639486   2635.763081      537.870553   1425.476744    499.539680       3.870671       206855.816909
std        2.003532      2.135952           12.585558   2181.615252      421.385070   1132.462122    382.329753       1.899822       115395.615874
min     -124.350000     32.540000            1.000000      2.000000        1.000000      3.000000      1.000000       0.499900        14999.000000
25%     -121.800000     33.930000           18.000000   1447.750000      296.000000    787.000000    280.000000       2.563400       119600.000000
50%     -118.490000     34.260000           29.000000   2127.000000      435.000000   1166.000000    409.000000       3.534800       179700.000000
75%     -118.010000     37.710000           37.000000   3148.000000      647.000000   1725.000000    605.000000       4.743250       264725.000000
max     -114.310000     41.950000           52.000000  39320.000000     6445.000000  35682.000000   6082.000000      15.000100       500001.000000

reset_index()方法

用於生成新的索引 id。因為一旦數據發生合並，其索引 id 仍然引用原來的，這樣 id 會重復。使用此方法后，生成新的按照順序的 id，原來的索引 id 會變成列名為 index 的一列。若將參數 drop 的值設為 True，則 index 這一列會被刪除。

import pandas as pd
rsi1 = pd.DataFrame({'age':[12,13,14,15,16], 'height':[155,160,178,142,190]})
rsi2 = pd.DataFrame({'age':[19], 'height':[183]})
rsi = [rsi1, rsi2]
result1 = pd.concat(rsi)
print(result1)
   age  height
0   12     155
1   13     160
2   14     178
3   15     142
4   16     190
0   19     183


result2 = result1.reset_index()
print(result2)
   index  age  height
0      0   12     155
1      1   13     160
2      2   14     178
3      3   15     142
4      4   16     190
5      0   19     183


result3 = result2.reset_index(drop=True)
print(result3)
   age  height
0   12     155
1   13     160
2   14     178
3   15     142
4   16     190
5   19     183

最后提一點和 pandas 無關的，但是可以讓數據可視化即畫圖。畫圖需要用到庫 matplotlib，畫直方圖用到其中的 hist() 方法。其中 bins 表示條狀圖的數量（對應 x 軸），figsize 是圖片大小。

housing.hist(bins=50, figsize=(20,15))
plt.show

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pandas 的數據結構（Series， DataFrame） pandas數據結構：Series/DataFrame；python函數：range/arange Pandas學習總結——1. 基礎操作(文件讀寫)、數據結構(Series、DataFrame)、常用基本函數、數據排序 dataframe數據結構小白學 Python 數據分析（4）：Pandas （三）數據結構 DataFrame 利用pandas進行數據分析之二：DataFrame與Series數據結構對比 Pandas ：Series、DataFrame 結構的數據創建 DataFrame和python中數據結構互相轉換 pandas數據結構之Panel筆記 pandas 數據結構的基本功能