1.df.loc[[index],[colunm]] 通過標簽選擇數據
loc
需要兩個單/列表/范圍運算符,用","
分隔。第一個表示行,第二個表示列
(1)獲取指定列的數據
df.loc[:,'reviews'] 注意: 第一個參數為:表示所有行,第2個參數為列名,設置獲取review列的數據
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos'] df.columns=columns_name print(df.head(3)) #輸出前3行 print (df.loc[:,'reviews'].head(3))
控制台輸出:
(2)選擇指定的多行多列
df.loc[[0,2],['customername','reviews','review_fenci']] 參數說明: [0,2] 這個列表有兩個元素0,2表示選擇第0行和第2行,['customername','reviews','review_fenci']這個列表有3個元素表示選擇列名為'customername','reviews','review_fenci‘的這3列
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos'] df.columns=columns_name print(df.head(3)) #輸出前3行 print (df.loc[[0,2],['customername','reviews','review_fenci']])
控制台輸出:
2.df.iloc[[index],[colunm]] 通過位置選擇數據
(1)選擇一列,以Series的形式返回列
(2)選擇兩列或兩列以上,以DataFrame形式返回多列
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos'] df.columns=columns_name print(df.head(3)) #輸出前3行 print (df.iloc[[0,2],[1,2]])
控制台輸出:
3.df[['列名1','列名2']]
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos'] df.columns=columns_name print(df.head(3)) #輸出前3行 print (df[['customername','reviews']])
控制台輸出:
4.按若干個列的組合條件篩選數據
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos'] df.columns=columns_name print(df.head(5)) #輸出前3行 print (df[(df['mysql_id']==201)&(df['aspectflag']==0.0)&(df['review_pos']==3)])
控制台輸出:
5.篩選某列中值大於n的數據且給另一列的空值填充數據
import pandas as pd df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_15256.csv',header=None,nrows=5) #在讀數之后自定義標題 columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos'] df.columns=columns_name print(df.head(3)) #輸出前3行 df1 = df[df['aspectflag']==1.0].copy() #df['aspectflag']==1.0 df1['review_pos']=df1['review_pos'].fillna('n/adj') print(df1.head(3))
控制台輸出:
注意:
df1 = df[df['aspectflag']==1.0].copy()
鏈式賦值是鏈式索引和賦值的組合。
典例:
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
其中:data[data.bidder == 'parakeet2004'] 作用是從數據表中篩選出bidder列值為parakeet2004的數據,['bidderrate']獲取前面篩選的列
這種類似的寫法會有警告:
A value is trying to be set on a copy of a slice from aDataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation:http://Pandas.pydata.org/Pandas-docs/stable/indexinghtml#indexing-view-versus-copy
解決方案:拆為兩部分,前面一部分使用copy(),生成一個副本。
6.dataframe新增一行
#創建一個空字典 pos_dict = {} #往字典里添加一組新的key和value pos_dict['pos'] = pos pos_dict['count'] = count # print(pos_dict) df = df.append([pos_dict],ignore_index=True) #給dataframe添加新的一行
7.dataframe選擇多列,並在指定位置插入一列
import os import pandas as pd #讀取csv文件的前200行,將其存儲為另一個文件 df=pd.read_csv('../csvfiles/hotelreviews_fenci_pos.csv',header=None,nrows=10) columns_name=['mysql_id','hotelname','customername','reviewtime','checktime','reviews','scores','type','room','useful','likenumber','review_split','review_pos','review_split_pos'] df.columns=columns_name #獲取dataframe表中的指定多列 df1=pd.DataFrame(df,columns=['mysql_id','hotelname','customername','reviews','review_split']) col_name = df1.columns.tolist() # 在reviews列后面插入列名為keywords的列 col_name.insert(col_name.index('reviews')+1,'keywords') df2=df1.reindex(columns=col_name) df2.to_csv('../csvfiles/reviews_split_200_keywords.csv', header=None, index=False)
8.讀取指定某些行
pd.read_csv(路徑,skiprows=需要忽略的行數,nrows=你想要讀的行數)
比如你想讀中間第10行-20行的內容
pd.read_csv(路徑,skiprows=9,nrows=10),忽略前9行,往下讀10行
def dev_csv(): df = pd.read_csv('../aspect_ner_csv_files/sentence_15000.csv', header=None,nrows=2683,skiprows=10256) columns_name = ['mysql_id', 'reviews'] df.columns = columns_name review_csv_count_path = '../aspect_ner_csv_files/sentence_dev.csv' df.to_csv(review_csv_count_path, header=None, index=False) # header=None指不把列號寫入csv當中
參考文獻:https://blog.csdn.net/destiny_python/article/details/78675036
https://blog.csdn.net/weixin_42575020/article/details/98846427