讀取:
import csv
with open('enrollments.csv', 'rb') as f: reader = csv.reader(f)
print reader
out:<_csv.reader object at 0x00000000063DAF48>
reader函數,接收一個可迭代的對象(比如csv文件),能返回一個生成器,就可以從其中解析出csv的內容:
比如下面的代碼可以讀取csv的全部內容,以行為單位:import csv
import csv
with open('enrollments.csv', 'rb') as f: reader = csv.reader(f) enrollments = list(reader)
import csv with open('enrollments.csv', 'rb') as f: reader = csv.reader(f)
enrollments=[row for row in reader]
print enrollments
#返回的類型都是:list
out:
[['account_key', 'status', 'join_date', 'cancel_date', 'days_to_cancel', 'is_udacity', 'is_canceled'],
['448', 'canceled', '2014-11-10', '2015-01-14', '65', 'True', 'True'],
['448', 'canceled', '2014-11-05', '2014-11-10', '5', 'True', 'True'],
['448', 'canceled', '2015-01-27', '2015-01-27', '0', 'True', 'True'],
[……]]
如果要提取其中的某一行,可以用下面的代碼:
import csv with open('enrollments.csv','rb')as csvenroll: reader=csv.reader(csvenroll) for col,rows in enumerate(reader): if col==2: #提取第二行 row=rows print(row)
#返回list類型
out:['448', 'canceled', '2014-11-05', '2014-11-10', '5', 'True', 'True']
如果要提取其中的某一列,可以用以下代碼:
import csv with open('enrollments.csv','rb')as csvenroll: reader=csv.reader(csvenroll) column=[row[2] for row in reader] #讀取第三列 print(column)
#返回list類型
out:['join_date', '2014-11-10', '2014-11-05', '2015-01-27', '2014-11-10', '2015-03-10', '2015-01-14', '2015-01-27',……]
這種方法是通用的方法,要事先知道行/列號。這時可以采用第二種方法:DictReader,和reader函數類似,接收一個可迭代的對象,能返回一個生成器,但是返回的每一個單元格都放在一個字典的值內,而這個字典的鍵則是這個單元格的標題(即列頭)。
用下面的代碼可以看到DictReader的結構:
import csv
with open('enrollments.csv', 'rb') as f: reader = csv.DictReader(f)
print reader
out:<unicodecsv.py2.DictReader instance at 0x0000000009AA07C8>
打印所有行:
import csv with open('enrollments.csv', 'rb') as f: reader = csv.DictReader(f) enrollments = list(reader)
import csv with open('enrollments.csv', 'rb') as f: reader = csv.DictReader(f) enrollments=[row for row in reader]
#返回整個list,list里面是dict
out:[{u'account_key': u'448', u'cancel_date': u'2015-01-14', u'days_to_cancel': u'65', u'is_canceled': u'True', u'is_udacity': u'True', u'join_date': u'2014-11-10', u'status': u'canceled'},
{'account_key': '448', 'cancel_date': '2014-11-10', 'days_to_cancel': '5', 'is_canceled': 'True', 'is_udacity': 'True', 'join_date': '2014-11-05', 'status': 'canceled'}……]
import csv with open('enrollments.csv', 'rb') as f: reader = csv.DictReader(f) for line in reader: print line
#返回dict
out:
{'status': 'canceled', 'is_udacity': 'True', 'is_canceled': 'True', 'join_date': '2014-11-10', 'account_key': '448', 'cancel_date': '2015-01-14', 'days_to_cancel': '65'} {'status': 'canceled', 'is_udacity': 'True', 'is_canceled': 'True', 'join_date': '2014-11-05', 'account_key': '448', 'cancel_date': '2014-11-10', 'days_to_cancel': '5'}
……
如果要提取其中的某一行
同reader方法,根據行號提取,但是提取的結果與reader方法不同,dictreader方法讀取結果是一個鍵對應一個value
import csv
with open('enrollments.csv','rb')as csvenroll: reader=csv.DictReader(csvenroll) for col,rows in enumerate(reader): if col==0: #提取第一行 row=rows print(row)
#返回dict類型
out:{'account_key': '448', 'cancel_date': '2015-01-14', 'days_to_cancel': '65', 'is_canceled': 'True', 'is_udacity': 'True', 'join_date': '2014-11-10', 'status': 'canceled'}
如果我們想用DictReader讀取csv的滿足特定值條件的某些行,就可以用列的標題查詢:
eg:查找所有cancel_date是2015-01-14的行
import csv import pprint with open('enrollments.csv','rb')as f: reader=csv.DictReader(f) for line in reader: if line['cancel_date']=='2015-01-14': pprint.pprint(line)
#返回的line是dict類型
{'account_key': '448', 'cancel_date': '2015-01-14', 'days_to_cancel': '65', 'is_canceled': 'True', 'is_udacity': 'True', 'join_date': '2014-11-10', 'status': 'canceled'} {'account_key': '60', 'cancel_date': '2015-01-14', 'days_to_cancel': '65', 'is_canceled': 'True', 'is_udacity': 'False', 'join_date': '2014-11-10', 'status': 'canceled'}{……}
讀取某一列
import csv with open('enrollments.csv','rb')as f: reader=csv.DictReader(f) columns=[row['account_key'] for row in reader] #直接根據想要提取的列名稱讀取,不能根據列號讀取 print(columns)
#返回list類型
out:['448', '448', '448', '448', '448', '448', '448', '448', '448', '700', '429', '429', '60', '60'……]
3.pandas模塊讀取
import pandas as pd data_df=pd.read_csv('enrollments.csv') print data_df
#返回dataframe類型
out: account_key status join_date cancel_date days_to_cancel \ 0 448 canceled 2014-11-10 2015-01-14 65.0 1 448 canceled 2014-11-05 2014-11-10 5.0 2 448 canceled 2015-01-27 2015-01-27 0.0 3 448 canceled 2014-11-10 2014-11-10 0.0 4 448 current 2015-03-10 NaN NaN 5 448 canceled 2015-01-14 2015-01-27 13.0 6 448 canceled 2015-01-27 2015-03-10 42.0 7 448 canceled 2015-01-27 2015-01-27 0.0 8 448 canceled 2015-01-27 2015-01-27 0.0 9 700 canceled 2014-11-10 2014-11-16 6.0
is_udacity is_canceled 0 True True 1 True True 2 True True 3 True True 4 True False 5 True True 6 True True 7 True True 8 True True 9 False True
讀取某行:使用loc()方法
data_df.loc[1] #只要知道index即可,不一定非要知道行號
account_key 448 status canceled join_date 2014-11-05 cancel_date 2014-11-10 days_to_cancel 5 is_udacity True is_canceled True
讀取某些行:
data_df.loc[:2]
讀取某一列:
data_df['status'] #返回series類型
out:
0 canceled 1 canceled 2 canceled 3 canceled 4 current 5 canceled 6 canceled 7 canceled 8 canceled 9 canceled
讀取某行某列的值:iloc()
data_df.iloc[0,2]
out:'2014-11-10'
二、excel格式
1.xlrd模塊讀取
import xlrd workbook=xlrd.open_workbook('enrollments.xls')
out:<xlrd.book.Book at 0xa8cbf98>
打印所有數據:
import xlrd import pprint #打開工作簿 workbook=xlrd.open_workbook('enrollments.xls') #選擇工作表2(也就是工作簿中的第二個sheet) sheet=workbook.sheet_by_index(1) #遍歷所有的列和行,並將所有的數據讀取成python列表 data=[[sheet.cell_value(row,col) for col in range(sheet.ncols)] for row in range(sheet.nrows)] pprint.pprint(data)
#返回list類型
[[u'account_key',u'status',u'join_date',u'cancel_date',u'days_to_cancel',u'is_udacity',u'is_canceled'], [448.0, u'canceled', 41953.0, 42018.0, 65.0, 1, 1], [448.0, u'canceled', 41948.0, 41953.0, 5.0, 1, 1], [448.0, u'canceled', 42031.0, 42031.0, 0.0, 1, 1], [448.0, u'canceled', 41953.0, 41953.0, 0.0, 1, 1], [448.0, u'current', 42073.0, u'', u'', 1, 0], [448.0, u'canceled', 42018.0, 42031.0, 13.0, 1, 1], [448.0, u'canceled', 42031.0, 42073.0, 42.0, 1, 1], [448.0, u'canceled', 42031.0, 42031.0, 0.0, 1, 1], [448.0, u'canceled', 42031.0, 42031.0, 0.0, 1, 1], [700.0, u'canceled', 41953.0, 41959.0, 6.0, 0, 1], [429.0, u'canceled', 41953.0, 42073.0, 120.0, 0, 1]]
行/列的數量:
print sheet.nrows
print sheet.ncols
out:12
7
讀取某行某列數據:
#打出剛剛生成列表中的第3行和第2列的值 data[3][2] #或者 sheet.cell_value(3,2)
讀取某行的數據:
sheet.row_values(1,start_colx=0,end_colx=7) #讀取第一行數據(不考慮表頭),這里的start/end_colx可以更改,從而來獲取某行從某列到某列的值
out:[448.0, u'canceled', 41953.0, 42018.0, 65.0, 1, 1]
讀取某列數據:
print sheet.col_values(2,start_rowx=0,end_rowx=7) #讀取第3列數據,1-6行
out:[u'join_date', 41953.0, 41948.0, 42031.0, 41953.0, 42073.0, 42018.0]
2.pandas模塊讀取
import pandas as pd workbook=pd.read_excel('enrollments.xls') #默認讀取工作簿的sheet1 workbook
如果要讀取第二個sheet:
import pandas as pd workbook=pd.read_excel('enrollments.xls',sheetname='Sheet2') workbook
讀取行、列等方法同前。
三、xml格式
使用xml.etree.ElementTree模塊
import xml.etree.ElementTree as ET import pprint tree=ET.parse('exampleResearchArticle.xml') root=tree.getroot() print 'children of root' #子元素 for child in root: print child.tag #使用標簽屬性來打印每個子元素的標簽名
out:
children of root ui ji fm bdy bm
獲取根元素里面的內容:
print "Authors' email addresses are as below:" for a in root.findall('./fm/bibl/aug/au'): #findall 會返回匹配該xpath表達式的所有元素 email=a.find('email') #對於每個元素,我們要進行“查找”以便定位 if email is not None: print email.text
out:
Authors' email addresses are as below: omer@extremegate.com mcarmont@hotmail.com laver17@gmail.com nyska@internet-zahav.net kammarh@gmail.com gideon.mann.md@gmail.com barns.nz@gmail.com eukots@gmail.com
四、html格式
使用beautifulsoup模塊
from bs4 import BeautifulSoup soup=BeautifulSoup(open('virgin_and_logan_airport.html')) data=[] carrierlist=soup.find(id='CarrierList') for i in carrierlist.find_all('option'): #這里與xml的findall不同,需要用find_all data.append(i['value']) print 'carrierlist:{}'.format(data)
out:
carrierlist:['All', 'AllUS', 'AllForeign', 'AS', 'G4', 'AA', '5Y', 'DL', 'MQ', 'EV', 'F9', 'HA', 'B6', 'OO', 'WN', 'NK', 'UA', 'VX']
寫入:
1.pandas模塊——csv
import csv import pandas as pd titanic_df=pd.read_csv('titanic_data.csv') titanic_new=titanic_df.dropna(subset=['Age']) titanic_new.to_csv('titanic_new.csv') #保存到當前目錄 titanic_new.to_csv('C:/asavefile/titanic_new.csv') #保存到其他目錄
2.pandas模塊——excel
to_excel
3.用csv模塊,一行一行寫入
1)從list寫入
前文發現通過reader方法讀取文件,返回的是list類型
import csv # 文件頭,一般就是數據名 fileHeader = ["name", "score"] # 假設我們要寫入的是以下兩行數據 d1 = ["Wang", "100"] d2 = ["Li", "80"] # 寫入數據 csvFile = open("C:/asavefile/instance.csv", "w") writer = csv.writer(csvFile) # 寫入的內容都是以列表的形式傳入函數 # 一行一行的寫入 writer.writerow(fileHeader) writer.writerow(d1) writer.writerow(d1) csvFile.close()
import csv with open('test_writer1.csv','wb') as f: writer=csv.writer(f) #先寫入表頭 writer.writerow(['index','name','age','city']) #然后寫入每行的內容 writer.writerows([(0,'sandra',12,'shanghai'), #用()或者[]好像沒什么影響,所以數組和list均可? [1,'cheam',13,'beijing'], [2,'tom',14,'tianjin'], [3,'tina',15,'chongqing']])
out:
import csv csvfile = open('C:/asavefile/test_writer2.csv', 'wb') #打開方式還可以使用file對象 writer = csv.writer(csvfile) data = [['name', 'age', 'telephone'], ('Tom', '25', '1234567'), ('Sandra', '18', '789456')] #表頭和內容一起寫入 writer.writerows(data) csvfile.close()
明白了道理,用哪個都一樣,用最后一種最簡單
2)從dict寫入
自己創建一張表:writer方法
dic = {'sandra':123, 'he':456, 'she':789} csvFile3 = open('C:/asavefile/csvFile3.csv','wb') writer = csv.writer(csvFile3) writer.writerow(['name','value']) for key in dic: writer.writerow([key, dic[key]]) csvFile3.close()
out:
完全復制一張表的內容:DictWriter方法
1 import csv
2 with open('C:/asavefile/enrollments.csv','rb') as f: #先打開需要復制的表格 3 reader=csv.DictReader(f) 4 line=[row for row in reader] 5 head=reader.fieldnames #reader方法沒有fieldnames方法 6 csvFile = open("C:/asavefile/enrollments_copy.csv", "wb") 7 # 文件頭以列表的形式傳入函數,列表的每個元素表示每一列的標識 8 fileheader = head 9 dict_writer = csv.DictWriter(csvFile,fileheader) 10 # 但是如果此時直接寫入內容,會導致沒有數據名,所以,應先寫數據名(也就是我們上面定義的文件頭)。 11 # 寫數據名,可以自己寫如下代碼完成: 12 dict_writer.writerow(dict(zip(fileheader,fileheader))) 13 # 之后,按照(屬性:數據)的形式,將字典寫入CSV文檔即可 14 dict_writer.writerows(line) 15 csvFile.close()
將滿足條件的值,寫入到一張新表:
#將accountkey=448的挑選出來並保存到一個新的csv import csv with open('C:/asavefile/enrollments_accout.csv','wb') as outfile: with open('enrollments.csv', 'rb') as f: reader = csv.DictReader(f) #獲取表頭 head=reader.fieldnames writer=csv.DictWriter(outfile,head) #寫入表頭的名字 writer.writerow(dict(zip(head,head)))
#開始一行一行寫入數據 for line in reader: if line['account_key']=='448': writer.writerow(line)