pandas 常用語句


pandas的功能非常強大,支持類似與sql的數據增、刪、查、改,並且帶有豐富的數據處理函數;

支持時間序列分析功能;支持靈活處理缺失數據等。
pandas的基本數據結構是Series和DataFrame
Series是序列,類似一維數組;
DataFrame相當於一張二維表格,類似二維數組,它的每一列都是一個Series。
為了定位Series中的元素,Pandas提供了Index對象,每個Series都會帶有一個對應的
Index,用來標記不同的元素,Index的內容不一定是數字,也可以是字母、中文等,它類似於sql中的
主鍵。
 
DataFrame相當於多個帶有同樣Index的Series的組合(本質是Series的容器),每個Series都帶
有唯一的表頭,用來標識不同的Series。
>>> import pandas as pd 
>>> s=pd.Series([1,2,3],index=['a','b','c'])
>>> s
a    1
b    2
c    3
>>> d=pd.DataFrame([[1,2,3],[4,5,6]],columns=['a','b','c'])
>>> d.head()
   a  b  c
0  1  2  3
1  4  5  6
>>> d.describe()
             a        b        c
count  2.00000  2.00000  2.00000
mean   2.50000  3.50000  4.50000
std    2.12132  2.12132  2.12132
min    1.00000  2.00000  3.00000
25%    1.75000  2.75000  3.75000
50%    2.50000  3.50000  4.50000
75%    3.25000  4.25000  5.25000
max    4.00000  5.00000  6.00000
>>> pd.read_excel('C:\\Users\someone\Desktop\data.xlsx','Sheet1')
               id       int  no    4       5         6   7    8
0       elec_code   varchar  no   50    電子表碼   varchar  no  100
1         user_id   varchar  no   50    用戶編號   varchar  no  100
2       user_name   varchar  no   50    用戶名稱   varchar  no  100
 
寫入excel
with pd.ExcelWriter('shanghai_%d.xlsx'%iii) as writer:
for i,j in dddit:
j.to_excel(writer,sheet_name=str(i))
#j 為DataFrame類型數據
 
定位dataframe中元素
In [14]: df.head() Out[14]:   A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
 
         
dates=['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06']


df[0:3] 通過【】切片列 ,axis=0 左閉右開 Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 In [25]: df['20130102':'20130104'] 兩邊包含 Out[25]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
 df['A'] 選擇單個列,等同於df.A, 單列的數據類型為series Out[23]:  2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03 -0.861849 2013-01-04 0.721555 2013-01-05 -0.424972 2013-01-06 -0.673690 Freq: D, Name: A, dtype: float64


df.loc[] 根據數據的索引值(標簽) 定位數據
 
        
df.loc[dates[0]]
df.loc[:,['A','B']]
df.loc['20130102':'20130104',['A','B']]
                   A         B
2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
df.loc['20130102',['A','B']]
df.loc[dates[0],'A'] Out[30]: 0.4691122

df.iloc[] 根據數據的位置序號定位數據,而不是索引的值
當入參為1個時,表示縱軸序號值為 y 的行,入參為兩個時(x,y),表示橫軸上序號為x,縱軸上序號為y的子集
分號 :同列表,左閉右開
d.iloc[1:2,[1,2]]   
df.iloc[3:5,0:2]
df.iloc[1:3,:]
df.iloc[1,1]
df.iloc[[1,2,4],[0,2]] 返回行號為1,2,4,列號為0,2的子集
df.iloc[3] 返回序號值為3的行



d.index 返回索引明細
d.dtypes 返回各列(column)的類型及名稱
 
        
填充空值
d=d.fillna('_')將NA以'_'值替換

排序

通過索引排序,默認是縱軸索引值,升序

df.sort_index(axis=0,ascending=True) 

通過數值排序

df.sort_values(by,axis=0,ascending=True)

by可以是單個列標簽,也可以是多個列標簽的列表

 
合並DataFrame
 
merge 原理像sql 的兩表關聯 join
pd1=pd.DataFrame(list1,columns=['userid',])
pd2=pd.DataFrame(list2,columns=['r','userid2','filialename','username','useraddress',])
pd3=pd.merge(pd1,pd2,how='left',left_on='userid',right_on='userid2')
how,連接方式'left','right','inner'
使用左邊的userid列和右邊的userid2列作為連接鍵即userid=userid2
根據某列的不同的值,創建出另一對應值的列,可用merge方法,連接兩個df
 
concat 直接拼接合並
dfs=[pd1,pd2,pd3]
datas=pd.concat(dfs,axis=1)
axis為1時,橫向連接 datas.columns  為 ['userid','r','userid2','filialename','username','useraddress','userid','r','userid2','filialename','username','useraddress',]
axis為0時,縱向連接 相當於union all

 

 
統計 頻率數
s=datas['filialename']
s.value_counts()

 

groupby,分組

類似sql的group by ,可以根據多個字段group by ,用列表
grouped=data.groupby('Fmfiliale')
通過多字段分組后求平均值
grouped2=df.groupby(['nameid','site','recordat']).mean()
#默認是 axis=0,即縱向分組
print(grouped.groups)
#結果如下:
#{'118.190.41.176:water': Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
#            17, 18, 19, 20, 21, 22, 23, 24],
#            dtype='int64'),
# '120.237.48.43:water': Int64Index([25], dtype='int64'),
# '222.245.76.42:water': Int64Index([26, 27, 28, 29], dtype='int64')}
for name,group in grouped:
print(name,grouped)
返回的是(str,pd)類型數據,上例中的name值為該組的Fmfiliale值。
對於空值(#NA)groupby會自動排除這一條數據

df2=grouped2.reset_index()                                                                                            將分組后的索引重新設置為數據

 

list_data=df2.values.tolist()                                                                                             將dataframe類型轉化為list

df_data['recordat'].apply(lambda x:x.strftime('%Y-%m-%d'))                                        將函數應用於df的某一列column

 

讀取excel詳細入參說明

pandas.read_excel(iosheetname=0header=0skiprows=Noneskip_footer=0index_col=None,parse_cols=Noneparse_dates=Falsedate_parser=Nonena_values=Nonethousands=Noneconvert_float=True,has_index_names=Noneconverters=Noneengine=None**kwds)

Read an Excel table into a pandas DataFrame

Parameters:

io : string, file-like object, pandas ExcelFile, or xlrd workbook.

The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could befile://localhost/path/to/workbook.xlsx

sheetname : string, int, mixed list of strings/ints, or None, default 0  表示讀取哪幾個工作簿,從0開始

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases

  • Defaults to 0 -> 1st sheet as a DataFrame
  • 1 -> 2nd sheet as a DataFrame
  • “Sheet1” -> 1st sheet as a DataFrame
  • [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
  • None -> All sheets as a dictionary of DataFrames

header : int, list of ints, default 0  將某一行設置為標題行,計數從0開始,在跳過行之后重新計數。如skiprows=2,header=2,則將取excel中索引(從0開始計數)為4的行為header,即pd.DataFrame的columns值

                                                表示去掉頭兩行數據以剩下的數據的索引數(從0開始)為2的行作為header

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex

skiprows : list-like  從開頭起,跳過哪幾行,默認值為None,等同於0 .【 如果取值skiprows=2,則將從excel中索引數(從0開始計數)為2(包含2)的row處開始讀取】

Rows to skip at the beginning (0-indexed)

skip_footer : int, default 0。 從尾端起,跳過哪幾行, 如等於2,則將跳過最后兩行以倒數第三行作為最后一行

Rows at the end to skip (0-indexed)

index_col : int, list of ints, default None  將某一列設置為索引,從0開始計數

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex

converters : dict, default None 以列名為鍵,函數為值,對該列的值應用該函數,取結果

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

parse_cols : int or list, default None 解析哪幾列,'A:E'表示解析A列到E列(含)

  • If None then parse all columns,
  • If int then indicates last column to be parsed
  • If list of ints then indicates list of column numbers to be parsed
  • If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

na_values : list-like, default None  列表,如遇到列表中的值,將其讀為na

List of additional strings to recognize as NA/NaN

thousands : str, default None

Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to

verbose : boolean, default False

Indicate number of NA values placed in non-numeric columns

engine: string, default None

If io is not a buffer or path, this must be set to identify io. Acceptable values are None or xlrd

convert_float : boolean, default True

convert integral floats to int (i.e., 1.0 –> 1). If False, all numeric data will be read in as floats: Excel stores all numbers as floats internally

has_index_names : boolean, default None

DEPRECATED: for version 0.17+ index names will be automatically inferred based on index_col. To read Excel output from 0.16.2 and prior that had saved index names, use True.

Returns:

parsed : DataFrame or Dict of DataFrames

DataFrame from the passed in Excel file. See notes in sheetname argument for more information on when a Dict of Dataframes is returned.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM