pandas學習以及matplotlib繪圖

本文轉載自查看原文 2019-11-26 22:09 373 數據分析+金融量化

pandas學習

一.簡介pandas是一個強大的python數據分析的工具包,它是基於Numpy構建的,正因為pandas的出現,讓python語言也成為使用最廣泛且強大的數據分析環境之一

Pandas的主要功能:1.具備對其功能的數據結構DataFrame,Series2.集成時間序列功能3.提供豐富的數學運算和操作4.靈活處理確實數據

安裝方法:pip install pandas引用方法:import pandas as pd

import pandas as pd;`

import numpy as np;
arr1 = np.array([1,2,3,4,5])
arr2 = np.array([5,4,3,2,1])
print(arr1*arr2)
[5 8 9 8 5]

二.SeriesSeries是一種類似於一維數組的對象,由一組數據和一組與之相關的數據標簽(索引)組成.有點類似於python中的字典1.創建方法

## 第一種創建方法 
s1 = pd.Series([1,2,3,4])
print(s1)
0    1
1    2
2    3
3    4
dtype: int64
print(s1[1])
2

將數組縮影以及數組的值打印出來,索引在左,值在右,由於沒有為數據制定索引,於是會自動創建一個0到N-1(N為數據的長度)的整數型索引,取值的時候可以通過索引取值,跟之前學過的數組和列表一樣


## 第二種
s2 = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s2)
print(s2['a'])
a    1
b    2
c    3
d    4
e    5
dtype: int64
1
## 自定義索引,Index是一個索引列表,里面包含的是字符串,依然可以通過默認索引取值
print(s2[1])


2
### 第三種
s3 = pd.Series({"a":1,"b":2})
print(s3)
print(s3["a"])
## 指定索引

a    1
b    2
dtype: int64
1

### 第四種

s4 = pd.Series(0,index = ['a','b','c'])
print(s4)


a    0
b    0
c    0
dtype: int64

對於Series,其實我們可以認為它是一個長度固定且有序的字典,因為它的索引和數據是按照位置進行匹配的,類似於字典的key-value 形式

Series 確實數據1.dropna() 過濾掉值為NaN的行2.fill() 填充確實數據3.isnull() 返回bool數組,缺失值為true4.notnull() 返回bool數組,缺失值對應為false

### 第一步.創建一個字典,通過series方式創建一個Series對象

st = {"age":18,"name":"szp","gender":"male"}

obj = pd.Series(st)
print(obj)
print(obj['name'])


age         18
name       szp
gender    male
dtype: object
szp
### 第二步. 第一一個索引變量
a = {"age","name","gender","address"}
### 第三步  
obj1 = pd.Series(st,index = a)
print(obj1)  ## 將第二步定義的a變量座位索引傳入

打印的結果:
gender     male
address     NaN
age          18
name        szp
dtype: object

gender     male
address     NaN
age          18
name        szp
dtype: object
## 因為 address沒有出現在st的鍵中,所以返回的是缺失值


## 通過上面的代碼演示,對於缺失值已經有了一個簡單的了解,接下來就看看如何判斷
## 缺失值

print(obj1) ## 先把這個兌現打印出來看看

gender     male
address     NaN
age          18
name        szp
dtype: object


gender     male
address     NaN
age          18
name        szp
dtype: object

obj1.isnull() ## 是缺失值返回true

## 運行結果
gender     False
address     True
age        False
name       False
dtype: bool


gender     False
address     True
age        False
name       False
dtype: bool


obj1.notnull()  ### 是缺失值返回true
## 運行結果
gender      True
address    False
age         True
name        True
dtype: bool
gender      True
address    False
age         True
name        True
dtype: bool


## 過濾缺失值,布爾型索引
obj1[obj1.notnull()]
## 運行結果 將是Null的,缺失的數據剔除掉
gender    male
age         18
name       szp
dtype: object

gender    male
age         18
name       szp
dtype: object

Series 特性1.從ndarray 創建Series:Series(arr)2.與標量(數字)運算; sr * 2 3.兩個Series 運算4.通用函數: np.ads(sr)5.布爾值過濾 sr[sr>0]6.統計函數 mean() sum() cunsum()

支持字典的特性: 1.從字典創建Series:Series(dic) 2.in 運算成員運算 'a' in sr, for x in sr 3.鍵索引 4.鍵切片 print(obj1)

    gender     male
    address     NaN
    age          18
    name        szp
    dtype: object

    print(obj1["gender":"age"])

    gender     male
    address     NaN
    age          18
    dtype: object
   
5.其他函數:get("a",default = 0)等

## 整數索引
## 接下來通過代碼來演示pandas中的整數索引

sr = pd.Series(np.arange(10))
sr1 = sr[3:].copy()
print(sr1)
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
 ### 到這里會發現一切正常,一點問題都沒有,可是當使用整數索引取值的時候就會發現出現問題了
 ### 因為在pandas 當中使用整數索引取值是有限以標簽解釋的,而不是下標

3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32


print(sr1[3])  ## 這個不會出錯,優先下標,打印出來的是3對應的3,而不是第4個
#   print(sr1[1]) ### 但是這個會報錯,因為sr1中沒有一個1的Key

解決方法

loc屬性 ## 以標簽解釋iloc 屬性 ## 以下標解釋


print(sr1.iloc[0])  ### 以下標解釋
print(sr1.loc[3] )  ## 以標簽解釋 
 ## 這兩個輸出結果都是一樣的,都是3

3
3

Series 數據對齊

pandas 在運算時,會按索引進行對齊,然后再進行計算,如果存在不同的索引,則結果的索引是連個操作數索引的並集


sr1 = pd.Series([12,23,34],index = ["c","a","d"])
sr2 = pd.Series([11,20,10],index = ['d','c','a'])
sr1+sr2

運行結果
a    33
c    32
d    45
dtype: int64
    ## 可以通過這種索引對齊直接將兩個Series對象進行運算

a    33
c    32
d    45
dtype: int64


sr3 = pd.Series([11,20,10,14],index = ['d','c','a','b'])
sr1 + sr3

## 運行結果
a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64
    ## sr1 和sr3的索引不一致,所以最終的運行會發現b縮影對應的值無法運算,就返回了
    ## NaN,一個缺失值

a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64

## 將兩個Series 對象相加時將缺失值設為 0

sr1 = pd.Series([12,23,34],index = ['a','b','c'])
sr3 = pd.Series([11,20,10,14],index = ['a','b','c','d'])
sr1.add(sr3,fill_value = 0)

運行結果:
a 23.0
b 43.0
c 44.0
d 14.0
dtype: float64
## 將缺失值設為 0 ,所以最后算出來d索引對應的值是14

a    23.0

b    43.0
c    44.0
d    14.0
dtype: float64

### 還有一些靈活的算術方法,add sub div mul 這里不一一介紹

DataFrame

DataFrame是一個表格型的數據結構,相當於是一個二維數組,含有一組有序的列,它可以被看做是由Series組成的字典,並且共用一個索引

創建方式:創建一個DataFrame數組可以有多種方式,其中最常用的方式就是利用包含等長度列表或者numpy數組的字典來形成DataFrame:


## 第一種
pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
## 產生的DataFrame 會自動為Series 分配索引,並且列會按照排序的順序排列

## 運行結果
   one  two
0  1  4
1  2  3
2  3  2
3  4  1

<div><style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style><table border="1" class="dataframe"> <thead> <tr > <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>4</td> </tr> <tr> <th>1</th> <td>2</td> <td>3</td> </tr> <tr> <th>2</th> <td>3</td> <td>2</td> </tr> <tr> <th>3</th> <td>4</td> <td>1</td> </tr> </tbody></table></div>


## 指定列
## 可以通過columns 參數指定順序排列

data = pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
pd.DataFrame(data,columns=['one','two'])

運行結果
   one  two
0  1  4
1  2  3
2  3  2
3  4  1
## 打印結果會按照columns參數指定順序

<div><style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


## 第二種
pd.DataFrame({'one':pd.Series([1,2,3,4],index=['a','b','c','d']),
              'two':pd.Series([6,7,8,9],index=['d','c','b','a'])})

##運行結果
   one  two
a  1  9
b  2  8
c  3  7
d  4  6

<div><style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style><table border="1" class="dataframe"> <thead> <tr > <th></th> <th>one</th> <th>two</th> </tr> </thead> <tbody> <tr> <th>a</th> <td>1</td> <td>9</td> </tr> <tr> <th>b</th> <td>2</td> <td>8</td> </tr> <tr> <th>c</th> <td>3</td> <td>7</td> </tr> <tr> <th>d</th> <td>4</td> <td>6</td> </tr> </tbody></table></div>

以上創建方式簡單了解就可以,因為在實際應用中更多是讀書節,不需要自己手動創建

查看數據常用屬性和方法:index 獲取行索引columns 獲取列索引T: 轉置values 獲取值索引describe 獲取快速統計

索引和切片

方法1: 兩個中括號,先取列,再取行 df'A'方法2:(推薦) 使用loc/ iloc 屬性,一個中括號,逗號隔開,先去行再取列, loc屬性:解釋為標簽 iloc屬性:解釋為下標向DataFrame對象中寫入值時只使用方法2列/行索引部分可以是常規索引,切片,布爾值索引,花式索引任意搭配,注意:兩部分都是花式索引是結果可能和預料的不同

4.時間對象處理

時間序列類型時間戳:特定時刻固定時期:如2019年1月時間間隔:起始時間-結束時間

python庫:datatimedate,time,datetime,timedelta

dt.strftime()strptime()靈活處理時間對象,dateutil包dateutil.parser.parse()

import time print(time.time)print(time.strftime())

5 數據分組和聚合

在數據分析當中,我們有時需要將數據拆分,然后再每一個特定的組里進行運算,這些操作通常也是數據分析工作中的重要環節5.1 分組 GroupBY機制

pandas對象,無論是Series,DataFrame還是其他什么的,當中的數據會根據提供的一個或者多個鍵被拆分為多組,拆分操作是在對象的特定軸上執行的,就比如DataFrame可以在他的行上或者列上進行分組,然后將一個函數應用到各個分組上並產生一個新的值,最后將所有的執行結果合並到最終的結果對象中

分組鍵的形式:列表或者數組,長度與待分組的軸一樣表示DataFrame某個列名的值字典或者Series,給出待分組軸上的值或與分組名之間的對應關系函數,用於處理軸索引或者索引中國的各個標簽碼

具體介紹可以見:https://www.cnblogs.com/xiaoyuanqujing/articles/11646477.html

讀取excel 表格或者CSV文件數據

https://www.cnblogs.com/happymeng/p/10481293.html

 可以在jupyter notebook 中直接直接操作excel文件,做數據的分析工作,具體的介紹方法在可以看上面的那個鏈接
也可以直接去博客園搜索相關的pandas 操作excel文件的案例

matplolib

這個模塊是用來畫圖的,可以將一些數據畫成各種圖形,還可以繪制三維圖

但是我們一般用Echarts繪圖.這個繪制的圖像更加的美觀,

https://www.echartsjs.com/examples/zh/index.html#chart-type-globe

還有highcharts

https://www.highcharts.com.cn/demo/highmaps

這兩個生成的圖片會更加的美觀,而且簡單

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 《利用Python進行數據分析》學習筆記之Matplotlib : pandas中的繪圖函數 pandas繪圖繪圖和可視化 Matplotlib，Pandas，Sseaborn, Pyecharts（數據蛙視頻課摘錄） matplotlib繪圖總結 Matplotlib快速繪圖理解matplotlib繪圖 matplotlib基本繪圖參數--轉 matplotlib 繪圖的核心原理 pandas 進行excel繪圖 pandas中的繪圖函數