shape()
返回數組或者數據框有多少行或者多少列
import numpy as np
x = np.array([[1,2,5],[2,3,5],[3,4,5],[2,3,6]])
#輸出數組的行和列數
print x.shape #結果: (4, 3)
#只輸出行數
print x.shape[0] #結果: 4
#只輸出列數
print x.shape[1] #結果: 3
因此可以用來遍歷行或者列
#計算每列的均值
ex=np.array(np.mean(x[:,i]) for i in range(x.shape[1]))
reshpae()
reshape()是數組array中的方法,作用是將數據重新組織
a = np.array([[1,2,3,4],[5,6,7,8]]) #二維數組
print(a.shape[0]) #值為2,最外層矩陣有2個元素,2個元素還是矩陣。
print(a.shape[1]) #值為4,內層矩陣有4個元素。
b= np.array([1,2,3,4,5,6,7,8])
b.reshape(2,4)
print(b)
#array([[1,2,3,4],
[5,6,7,8]])
pd.Dataframe.columns
返回數據框的列名
pd.Dataframe.columns.values
返回數據框的的列值
[[]]
我之前想提取兩列,哈哈,想半天,最后看了一個同學給的demo
直接pd.[["列名","列名"]]
還是見的太少了
_
就是常見的命名規則,
這里指代損失函數
# Create centroids with kmeans for 2 clusters
cluster_centers,_ = kmeans(fifa[scaled_features], 2)
unique()
去重函數,默認是行去重
[]
# Leave this list as is
number_cols = ['HP', 'Attack', 'Defense']
# Remove the feature without variance from this list
non_number_cols = ['Name', 'Type', 'Legendary']
# Create a new dataframe by subselecting the chosen features
df_selected = pokemon_df[number_cols + non_number_cols]
<script.py> output:
HP Attack Defense Name Type Legendary
0 45 49 49 Bulbasaur Grass False
1 60 62 63 Ivysaur Grass False
2 80 82 83 Venusaur Grass False
3 80 100 123 VenusaurMega Venusaur Grass False
4 39 52 43 Charmander Fire False
比如這個栗子,可以用來提取子數據框
format
print("{} rows in test set vs. {} in training set. {} Features.".format(X_test.shape[0], X_train.shape[0], X_test.shape[1]))
輸出保留一位百分比小數的結果
print("{0:.1%} accuracy on test set.".format(acc))
isnull()
判斷是否有缺失值
返回bool
.sum()
除了求和之外還有判斷個數此時等同於count
pd.isnull.sum()
.dtypes
DataFrame.dtypes
返回DataFrame中的dtypes
這將返回一個Series,其中包含每列的數據類型。結果的索引是原始DataFrame的列。具有混合類型的列與objectdtype 一起存儲
1.type() 返回參數的數據類型
2.dtype 返回數組中元素的數據類型
3.astype() 對數據類型進行轉換
value_counts()
value_counts()是一種查看表格某列中有多少個不同值的快捷方法,並計算每個不同值有在該列中有多少重復值。
所以就是統計
In [3]: volunteer["category_desc"].value_counts()
Out[3]:
Strengthening Communities 307
Helping Neighbors in Need 119
Education 92
Health 52
Environment 32
Emergency Preparedness 15
Name: category_desc, dtype: int64
apply
我先放個栗子,后面繼續補充這個函數,感覺做一些簡單的處理很好用
# Create a list of the columns to average
run_columns = ["run1", "run2", "run3", "run4", "run5"]
# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)
# Take a look at the results
print(running_times_5k)
script.py> output:
name run1 run2 run3 run4 run5 mean
0 Sue 20.1 18.5 19.6 20.3 18.3 19.36
1 Mark 16.5 17.1 16.9 17.6 17.3 17.08
2 Sean 23.5 25.1 25.2 24.6 23.9 24.46
3 Erin 21.7 21.1 20.9 22.1 22.2 21.60
4 Jenny 25.8 27.1 26.1 26.7 26.9 26.52
5 Russell 30.9 29.6 31.4 30.4 29.9 30.44