Python數據分析學習筆記

本文轉載自查看原文 2019-06-17 19:36 689 Python

利用Python進行數據分析這本書，介紹了高效解決各種數據分析問題的Python語言和庫，結合其他學習資源集中總結一下Python數據分析相關庫的知識點。

數據分析相關庫

(1) NumPy

NumPy(Numerical Python)是Python科學計算的基礎包，支持大量的維度數組與矩陣運算，此外也針對數組運算提供大量的數學函數庫。也就是說，Numpy是一個運行速度非常快的數學庫，主要功能包括：

快速高效的多維數組對象ndarray
用於對數組執行元素級計算以及直接對數組執行數學運算的函數
用於讀寫硬盤上基於數組的數據集的工具
線性代數運算、傅里葉變換，以及隨機數生成
用於將C、C++、Fortran代碼集成到python的工具

NumPy最重要的特點是其N維數組對象(ndarray)，該對象是一個快速而良好的大數據集容器，需要掌握數組對象的常用語法。

import numpy as np
data = np.array([1, 2, 3, 4, 5])
print (data)

#輸出 [1 2 3 4 5] 
print (data.shape) #shape 表示各維度大小的元組
#輸出
(5,)   #一維數組 

print (data.ndim) #ndim 維度大小
#輸出 
1
print (data.size) #size 表示多少元素
#輸出 
5
print (data.dtype) #dtype 數組的數據類型
#輸出 
int32
print(np.zeros((2,3))) #創建指定長度的全0數組
#輸出
[[0. 0. 0.]
 [0. 0. 0.]]
print (np.ones((3,6))) #創建指定長度的全1數組
#輸出
[[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]]
print (np.empty((3,2,3))) #創建一個沒有任何值的數組
#輸出
[[[1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]]]
print (np.arange(0,10,2)) #arrange是Python內置函數range的數組版
#輸出
[0 2 4 6 8]
print (np.arange(15).reshape(3,5)) #reshape 轉換成3*5的矩陣
#輸出
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
print (np.arange(10)[5:8]) #切片
#輸出
[5 6 7]
print (np.random.random((2,3))) #numpy.random模塊對Python內置的random進行補充
#輸出
[[0.63545712 0.36970827 0.27986446]
 [0.49481143 0.76131889 0.65610538]]

a = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])

a
Out[8]: 
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

a[0], a[-1]     #二維數組索引 取第1行、最后1行
Out[9]: (array([1, 2, 3]), array([7, 8, 9]))

a[:, 1]    #二維數組切片 取第2列
Out[10]: array([2, 5, 8])

a[1:3, :]    #二維數組切片 取第2、3行
Out[11]: 
array([[4, 5, 6],
       [7, 8, 9]])

a
Out[15]: 
array([[ 0.34927643,  0.56167914],
       [ 0.53429451,  0.38356559],
       [ 0.37718082,  0.32356081]])

a.ravel()     #展平數組
Out[16]: 
array([ 0.34927643,  0.56167914,  0.53429451,  0.38356559,  0.37718082,
        0.32356081])

a = np.random.randint(10, size=(3,3))

b = np.random.randint(10, size=(3,3))

a, b
Out[19]: 
(array([[0, 5, 6],
        [3, 1, 5],
        [5, 2, 1]]), array([[8, 3, 4],
        [6, 1, 1],
        [8, 5, 5]]))

np.vstack((a, b))    #垂直拼合數組
Out[20]: 
array([[0, 5, 6],
       [3, 1, 5],
       [5, 2, 1],
       [8, 3, 4],
       [6, 1, 1],
       [8, 5, 5]])

np.hstack((a, b))  #水平拼合數組
Out[21]: 
array([[0, 5, 6, 8, 3, 4],
       [3, 1, 5, 6, 1, 1],
       [5, 2, 1, 8, 5, 5]])

np.hsplit(a, 3)   #沿橫軸分割數組
Out[22]: 
[array([[0],
        [3],
        [5]]), array([[5],
        [1],
        [2]]), array([[6],
        [5],
        [1]])]

np.vsplit(a, 3)   #沿縱軸分割數組
Out[23]: [array([[0, 5, 6]]), array([[3, 1, 5]]), array([[5, 2, 1]])]

a = np.array(([1, 4, 3], [6, 2, 9], [4, 7, 2]))

a
Out[25]: 
array([[1, 4, 3],
       [6, 2, 9],
       [4, 7, 2]])

np.min(a, axis=1)    #返回每行最小值
Out[26]: array([1, 2, 2])

np.max(a, axis = 0)   #返回每列最大值
Out[27]: array([6, 7, 9])

np.argmax(a, axis=0)  #返回每列最大值索引
Out[28]: array([1, 2, 1], dtype=int64)

np.argmin(a, axis=1)  #返回每行最小值索引
Out[29]: array([0, 1, 2], dtype=int64)

#數組統計

np.median(a, axis=0)    # 統計數組各列的中位數
Out[30]: array([ 4.,  4.,  3.])

np.mean(a, axis=1)     #統計數組各行的算術平均值
Out[31]: array([ 2.66666667,  5.66666667,  4.33333333])

np.average(a, axis=0)   #統計數組各列的加權平均值
Out[32]: array([ 3.66666667,  4.33333333,  4.66666667])

np.var(a, axis=1)      #統計數組各行的方差
Out[33]: array([ 1.55555556,  8.22222222,  4.22222222])

使用 Z-Score 標准化算法對數據進行標准化處理，Z-Score 標准化公式

\[Z = \frac{X-\mathrm{mean}(X)}{\mathrm{sd}(X)} \]

#Z-Score標准化公式
import numpy as np
#根據公式定義函數
def zscore(x, axis = None):
    xmean = x.mean(axis=axis, keepdims=True)
    xstd = np.std(x, axis=axis, keepdims=True)
    zscore = (x-xmean)/xstd
    return zscore

#生成隨機數據
Z = np.random.randint(10, size=(5,5))
print(Z)
print(zscore(Z))
#輸出
[[1 2 2 6 2]
 [0 0 4 0 1]
 [4 0 9 2 1]
 [3 7 1 5 3]
 [4 2 4 5 2]]
[[-0.78935222 -0.35082321 -0.35082321  1.40329283 -0.35082321]
 [-1.22788123 -1.22788123  0.52623481 -1.22788123 -0.78935222]
 [ 0.52623481 -1.22788123  2.71887986 -0.35082321 -0.78935222]
 [ 0.0877058   1.84182184 -0.78935222  0.96476382  0.0877058 ]
 [ 0.52623481 -0.35082321  0.52623481  0.96476382 -0.35082321]]

使用 Min-Max 標准化算法對數據進行標准化處理，Min-Max 標准化公式

\[Y = \frac{Z-\min(Z)}{\max(Z)-\min(Z)} \]

#Min-Max 標准化公式
import numpy as np
def min_max(x, axis=None):
    min = x.min(axis=axis, keepdims=True)
    max = x.max(axis=axis, keepdims=True)
    result = (x-min)/(max-min)
    return result
Z = np.random.randint(10, size=(5, 5))
print(Z)
print(min_max(Z))

#輸出

[[3 8 3 2 7]
 [9 2 6 3 4]
 [4 5 9 0 1]
 [6 6 4 1 4]
 [1 2 2 1 6]]
[[ 0.33333333  0.88888889  0.33333333  0.22222222  0.77777778]
 [ 1.          0.22222222  0.66666667  0.33333333  0.44444444]
 [ 0.44444444  0.55555556  1.          0.          0.11111111]
 [ 0.66666667  0.66666667  0.44444444  0.11111111  0.44444444]
 [ 0.11111111  0.22222222  0.22222222  0.11111111  0.66666667]]

使用 L2 范數對數據進行標准化處理，L2 范數計算公式：

\[L_2 = \sqrt{x_1^2 + x_2^2 + \ldots + x_i^2} \]

#L2范數標准化
import numpy as np
def l2_normalize(v, axis=-1, order=2):
    l2 = np.linalg.norm(v, ord=order, axis=axis, keepdims=True)
    l2[l2==0] = 1
    return v/l2
Z = np.random.randint(10, size=(5,5))
print(Z)
print(l2_normalize(Z))
#輸出
[[2 0 2 0 4]
 [8 1 9 2 1]
 [8 6 4 2 5]
 [4 4 7 5 5]
 [3 6 3 1 0]]
[[ 0.40824829  0.          0.40824829  0.          0.81649658]
 [ 0.65103077  0.08137885  0.73240961  0.16275769  0.08137885]
 [ 0.66436384  0.49827288  0.33218192  0.16609096  0.4152274 ]
 [ 0.34948162  0.34948162  0.61159284  0.43685203  0.43685203]
 [ 0.40451992  0.80903983  0.40451992  0.13483997  0.        ]]

總結：ndarray是一個通用的同構數據多維容器，也就是說，其中的所有元素必須是相同類型的。除了上面介紹的numpy的用法以外，numpy同樣支持數組的各類計算，包括索引、點積、轉置，快速的元素級數組函數(abs() sqrt() exp() add() maximun())、邏輯運算、數組統計方法(mean() sum() std() var())、排序(sort)和集合、線性代數函數(dot() diag() trace() det() eig()) 等。

(2) pandas

pandas提供了使數據分析工作變得更快更簡單的高級數據結構和操作工具。pandas兼具Numpy高性能的數組計算功能以及電子表格和關系型數據（如SQL）靈活的數據處理能力。它是基於NumPy構建的，讓以NumPy為中心的應用變得更加簡單。Pandas 的數據結構：Pandas 主要有 Series（一維數組），DataFrame（二維數組），Panel（三維數組），Panel4D（四維數組），PanelND（更多維數組）等數據結構。其中 Series 和 DataFrame 應用的最為廣泛。

Series 是一維帶標簽的數組，它可以包含任何數據類型。包括整數，字符串，浮點數，Python 對象等。Series 可以通過標簽來定位。即Series基本結構為pandas.Series(data=None, index=None)
DataFrame 是二維的帶標簽的數據結構，可以通過標簽來定位數據。這是 NumPy 所沒有的。即DataFrame基本結構為pandas.DataFrame(data=None, index=None, columns=None)

#創建Series 數據類型
#方式1 從列表創建 Series
import pandas as pd
obj = pd.Series([4, 7, -5, 3])
print (obj)

#輸出
0    4
1    7
2   -5
3    3
dtype: int64

#方式2 從 Ndarray 創建 Series

import numpy as np
import pandas as pd
n = np.random.randn(5)
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(n, index=index)
print(s)
#輸出
a   -1.110084
b   -0.141548
c    0.177586
d   -0.437167
e   -1.348287
dtype: float64

#方式3 從字典創建 Series


import pandas as pd
d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
s = pd.Series(d)
print(s)
#輸出
a    1
b    2
c    3
d    4
e    5
dtype: int64

DataFrame是一個表格型的數據結構，它含有一組有序的列，每列可以是不同的值類型。DataFrame既有行索引也有列索引，可以看做是由Series組成的字典(共用同一個索引)，另外，DataFrame中面向行和面向列的操作基本上是平衡的。構建DataFrame的辦法有很多，最常用的是直接傳入一個由等長列表或NumPy數組組成的字典。

創建 DataFrame 數據類型

#方法1 通過字典數組創建 DataFrame
import pandas as pd
data = {'name':['Jame','Lily','Noe'],'age':[21,19,17]}
frame = pd.DataFrame(data)
print (frame)
#輸出
   age  name
0   21  Jame
1   19  Lily
2   17   Noe

從np.array 轉換為 pd.DataFrame

#方式2 通過 NumPy 數組創建 DataFrame
import pandas as pd
data = np.array([('Jame',21),('Lily',19),('Noe',17)])
frame = pd.DataFrame(data,index = range(1,4),columns=['name','age'])
print (frame)
#輸出
  name age
1  Jame  21
2  Lily  19
3   Noe  17

(3) matplotlib

繪圖是數據分析工作中的重要部分，可以幫助我們找到異常值、必要的數據轉換、得出有關模型的Idea等，Python有許多可視化工具，主要介紹使用 Matplotlib 繪圖的方法和技巧。
使用 Matplotlib 提供的面向對象 API，需要導入 pyplot 模塊，簡稱為 plt，pyplot 模塊是 Matplotlib 最核心的模塊，幾乎所有樣式的 2D 圖形都是經過該模塊繪制出來的。舉例，通過 1 行代碼繪制2D圖形

from matplotlib import pyplot as plt
plt.plot([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
         [1, 2, 3, 2, 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1])

另有 plt.bar([1, 2, 3], [1, 2, 3]) 繪制柱形圖， plt.scatter() 繪制散點圖， plt.pie() 繪制餅狀圖等
此外，Matplotlib提供兼容MATLAB API ，需要導入pylab模塊

import numpy as np
from matplotlib import pylab
x = np.linspace(0, 10, 20)     #使用 NumPy 生成隨機數據
y = x*x +2
pylab.plot(x, y, 'r')

如果要繪制子圖，就可以使用 subplot 方法繪制子圖

pylab.subplot(1, 2, 1)
pylab.plot(x, y, 'r--')
pylab.subplot(1, 2, 2)
pylab.plot(y, x, 'g*-')

上面講到使用 Matplotlib 中的 pyplot 模塊繪制簡單的 2D 圖像。其實，Matplotlib 也可以繪制 3D 圖像，與二維圖像不同的是，繪制三維圖像主要通過 mplot3d 模塊實現。
mplot3d 模塊下主要包含 4 個大類：

mpl_toolkits.mplot3d.axes3d()
mpl_toolkits.mplot3d.axis3d()
mpl_toolkits.mplot3d.art3d()
mpl_toolkits.mplot3d.proj3d()
其中，axes3d() 下面主要包含了各種實現繪圖的類和方法。axis3d() 主要是包含了和坐標軸相關的類和方法。art3d() 包含了一些可將 2D 圖像轉換並用於 3D 繪制的類和方法。proj3d() 中包含一些零碎的類和方法。
一般，用到最多的就是 mpl_toolkits.mplot3d.axes3d() 下面的 mpl_toolkits.mplot3d.axes3d.Axes3D() 類，例如，繪制三維散點圖

import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

#x y z是0到1之間的100個隨機數
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
z = np.random.normal(0, 1, 100)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(x, y, z)

(4) SciPy

SciPy（Scientific Python）是開源的Python算法庫和數學工具包。SciPy 包含的模塊有最優化、線性代數、積分、插值、特殊函數、快速傅里葉變換、信號處理和圖像處理、常微分方程求解和其他科學與工程中常用的計算。

常量模塊
為了方便科學計算，SciPy 提供了一個叫 scipy.constants 模塊，該模塊下包含了常用的物理和數學常數及單位。

from scipy import constants

constants.pi    #數學中的圓周率
Out[47]: 3.141592653589793

constants.golden    #黃金分割常數
Out[48]: 1.618033988749895

constants.c, constants.speed_of_light     #真空中的光速、普朗克系數
Out[49]: (299792458.0, 299792458.0)

constants.h, constants.Planck
Out[50]: (6.62607004e-34, 6.62607004e-34)

線性代數
線性代數是科學計算中最常涉及到的計算方法之一，SciPy 中提供了各種線性代數計算函數。這些函數基本都放置在模塊 scipy.linalg 下方。又大致分為：基本求解方法，特征值問題，矩陣分解，矩陣函數，矩陣方程求解，特殊矩陣構造等。

import numpy as np

from scipy import linalg

linalg.inv(np.matrix([[1, 2], [3, 4]]))   #矩陣的逆，用到 scipy.linalg.inv 函數
Out[53]: 
array([[-2. ,  1. ],
       [ 1.5, -0.5]])

U, s, Vh = linalg.svd(np.random.randn(5, 4))    #scipy.linalg.svd 函數 ，隨機矩陣完成奇異值分解

U, s, Vh
Out[55]: 
(array([[-0.25343531, -0.88646496,  0.01813706,  0.3488603 ,  0.16708669],
        [-0.34041342, -0.04125014, -0.19885736, -0.65156027,  0.6467937 ],
        [ 0.57594129,  0.08312343,  0.32251328,  0.28171905,  0.69137667],
        [-0.53214652,  0.18716945,  0.82464561,  0.03584257,  0.02150823],
        [-0.45277031,  0.41296053, -0.41960887,  0.61083172,  0.27436408]]),
 array([ 2.9389141 ,  2.27087176,  1.9896549 ,  0.59718697]),
 array([[-0.15608057,  0.10167361, -0.35736611, -0.91519987],
        [ 0.46960652,  0.36611405, -0.76093007,  0.25771234],
        [ 0.28471854,  0.79846434,  0.50650073, -0.15762948],
        [-0.82100178,  0.46698787, -0.1916557 ,  0.26673301]]))

最小二乘法求解函數 scipy.linalg.lstsq，現在用其完成一個最小二乘求解過程
首先給出樣本的 \(x\) 和 \(y\) 值。然后假設其符合 \(y = ax^2 + b\) 分布

import numpy as np
x = np.array([1, 2.5, 3.5, 4, 5, 7, 8.5])
y = np.array([0.3, 1.1, 1.5, 2.0, 3.2, 6.6, 8.6])

然后計算 \(x^2\) ，並添加截距項系數 1

M = x[:, np.newaxis]**[0, 2]
print(M)
#輸出 $x^2$
[[  1.     1.  ]
 [  1.     6.25]
 [  1.    12.25]
 ..., 
 [  1.    25.  ]
 [  1.    49.  ]
 [  1.    72.25]]

接着使用 linalg.lstsq 執行最小二乘法計算，返回的第一組參數即為擬合系數

from scipy import linalg
p = linalg.lstsq(M, y)[0]
print(p)
#輸出擬合系數
[ 0.20925829  0.12013861]

最后，通過繪圖查看最小二乘法得到的參數是否合理，繪制樣本和擬合曲線圖。

from matplotlib import pyplot as plt
plt.scatter(x, y)
xx = np.linspace(0, 10, 100)
yy = p[0] + p[1]*xx**2
plt.plot(xx, yy)

插值函數
插值是數值分析領域中通過已知的、離散的數據點，在范圍內推求新數據點的過程或方法。SciPy 提供的 scipy.interpolate 模塊包含了大量的數學插值方法。
例如，使用 SciPy 完成線性插值的過程。首先，給出一組 \(x\) 和 \(y\) 的值。

import numpy as np
from matplotlib import pyplot as plt
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
plt.scatter(x, y)

在上方兩個點與點之間再插入一個值，這里就可以用到線性插值的方法。

from scipy import interpolate
xx = np.array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5])  #兩點之間的點的x坐標
f = interpolate.interp1d(x, y)  #使用原樣本點建立插值函數
yy = f(xx)  #映射到新樣本點
plt.scatter(x, y)
plt.scatter(xx, yy, marker='*')

(5) scikit-learn

scikit-learn簡稱sklearn，是機器學習的一個開源框架、也是一個重要的Python模塊，其中包含多種成熟的算法，包括：

分類
回歸
聚類(非監督分類)
數據降維
模型選擇
數據預處理
關於scikit-learn的使用方法，可以查看我的另一篇博文Python機器學習(Sebastian著 ) 學習筆記——第六章模型評估與參數調優實戰(Windows Spyder Python 3.6)

歡迎大家提供寶貴建議
博客以學習、分享為主！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 數據分析學習筆記(三)-NetworkX的使用 Python數據分析學習（一）：Numpy與純Python計算向量加法速度比較 python數據分析學習(1)pandas一維工具Series講解 python數據分析學習(2)pandas二維工具DataFrame講解數據分析學習資料《利用Python進行數據分析第2版》+《Python數據分析與挖掘實戰》+《從零開始學Python數據分析與挖掘》 python數據分析入門學習筆記 python數據分析入門學習筆記兒 Python數據分析之numpy學習 Python數據分析之pandas學習 Python數據分析之pandas學習