申明:本系列文章是自己在學習《利用Python進行數據分析》這本書的過程中,為了方便后期自己鞏固知識而整理。
1 pandas讀取文件的解析函數
read_csv 讀取帶分隔符的數據,默認分隔符 逗號
read_table 讀取帶分隔符的數據,默認分隔符 “\t”
read_fwf 讀取定寬、列格式數據(無分隔符)
read_clipboard 讀取剪貼板中的數據(將網頁轉換為表格)
1.1 讀取excel數據
import pandas as pd import numpy as np file = 'D:\example.xls' pd = pd.read_excel(file) pd
運行結果:
1.1.1 不顯示表頭
pd = pd.read_excel(file,header=None)
運行結果:
1.1.2 設置表頭
pd = pd.read_excel(file,names=['Year','Name','Math','Chinese','EngLish','Avg'])
運行結果:
1.1.3 指定索引
pd = pd.read_excel(file,index_col= '姓名')
運行結果:
2 讀取CSV數據
import pandas as pd import numpy as np pd = pd.read_csv("d:\\test.csv",engine='python') pd
運行結果:
import pandas as pd import numpy as np pd = pd.read_table("d:\\test.csv",engine='python') pd
運行結果:
import pandas as pd import numpy as np pd = pd.read_fwf("d:\\test.csv",engine='python') pd
運行結果:
3 將數據寫出到文本格式
將數據寫出到csv格式,默認分隔符 逗號
import pandas as pd import numpy as np pd = pd.read_fwf("d:\\test.csv",engine='python') pd.to_csv("d:\\test1.csv",encoding='gbk')
運行結果:
4 手工處理分隔符格式
單字符分隔符文件,直接用csv模塊
import pandas as pd
import numpy as np
import csv
file = 'D:\\test.csv'
pd = pd.read_csv(file,engine='python')
pd.to_csv("d:\\test1.csv",encoding='gbk',sep='/')
f = open("d:\\test1.csv")
reader = csv.reader(f)
for line in reader:
print(line)
運行結果:
4.1 缺失值填充
import pandas as pd import numpy as np import csv file = 'D:\\test.csv' pd = pd.read_csv(file,engine='python') pd.to_csv("d:\\test1.csv",encoding='gbk',sep='/',na_rep='NULL') f = open("d:\\test1.csv") reader = csv.reader(f) for line in reader: print(line)
運行結果:
4.2 JSON
4.2.1 json.loads 可將JSON字符串轉換成Python形式
import pandas as pd import numpy as np import json obj = """{ "sucess" : "1", "header" : { "version" : 0, "compress" : false, "times" : 0 }, "data" : { "name" : "BankForQuotaTerrace", "attributes" : { "queryfound" : "1", "numfound" : "1", "reffound" : "1" }, "columnmeta" : { "a0" : "DATE", "a1" : "DOUBLE", "a2" : "DOUBLE", "a3" : "DOUBLE", "a4" : "DOUBLE", "a5" : "DOUBLE", "a6" : "DATE", "a7" : "DOUBLE", "a8" : "DOUBLE", "a9" : "DOUBLE", "b0" : "DOUBLE", "b1" : "DOUBLE", "b2" : "DOUBLE", "b3" : "DOUBLE", "b4" : "DOUBLE", "b5" : "DOUBLE" }, "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ] } } """ result = json.loads(obj) result
運行結果:
4.2.2 json.dumps可將Python字符串轉換成JSON形式
result = json.loads(obj)
asjson=json.dumps(result)
asjson
運行結果:
4.2.3 JSON數據轉換成DataFrame
import pandas as pd import numpy as np from pandas import DataFrame import json obj = """{ "sucess" : "1", "header" : { "version" : 0, "compress" : false, "times" : 0 }, "data" : { "name" : "BankForQuotaTerrace", "attributes" : { "queryfound" : "1", "numfound" : "1", "reffound" : "1" }, "columnmeta" : { "a0" : "DATE", "a1" : "DOUBLE", "a2" : "DOUBLE", "a3" : "DOUBLE", "a4" : "DOUBLE", "a5" : "DOUBLE", "a6" : "DATE", "a7" : "DOUBLE", "a8" : "DOUBLE", "a9" : "DOUBLE", "b0" : "DOUBLE", "b1" : "DOUBLE", "b2" : "DOUBLE", "b3" : "DOUBLE", "b4" : "DOUBLE", "b5" : "DOUBLE" }, "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ] } } """ result = json.loads(obj) result jsondf = DataFrame(result['data'],columns = ['name','attributes','columnmeta'],index={1,2,3}) jsondf
運行結果:
備注:其中attributes和columnmeta,存在嵌套,這個問題后面再補充。
4.3 XML和HTML
爬取同花順網頁中的列表數據,並轉換成DataFrame
在爬取的時候,我這里沒有考慮爬分頁的數據,有興趣的可以自己嘗試,我這里主要是想嘗試爬取數據后轉成DataFrame
代碼如下:
import pandas as pd import numpy as np from pandas.core.frame import DataFrame from lxml.html import parse import requests from bs4 import BeautifulSoup import time url = 'http://data.10jqka.com.cn/market/longhu/' headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"} response = requests.get(url = url,headers = headers) html = response.content soup = BeautifulSoup(html, 'lxml') s = soup.find_all('div','yyb') # 獲取dataframe所需的columns def getcol(): col = [] for i in s: lzs = i.find_all('thead') for k in lzs: lbs = k.find_all('th') for j in lbs: col.append(j.text.strip('\n')) return col # 獲取dataframe所需的values def getvalues(): val = [] for j in s: v = j.find_all('tbody') for k in v: vv = k.find_all('tr') list = [] for l in vv: tdlist = [] vvv = l.find_all('td') for m in vvv: tdlist.append(m.text) list.append(tdlist) return(list) if __name__ == "__main__": cols = getcol() values = getvalues() data=DataFrame(values,columns=cols) print(data)
運行結果:
4.4 二進制數據格式
pandas對象的save方法保存,load方法讀回到Python
4.5 HDF5格式
HDF是層次型數據格式,HDF5文件含一個文件系統式的節點結構,支持多個數據集、元數據,可以高效的分塊讀寫。Python中的HDF5庫有2個接口:PyTables和h5py。
海量數據應該考慮用這個,現在我沒用着,先不研究了。
4.6 使用HTML和Web API
import requests
import pandas as pd
from pandas import DataFrame
import json
url = 'http://t.weather.sojson.com/api/weather/city/101030100'
resp = requests.get(url)
data = json.loads(resp.text)#這里的data是一個dict
jsondf = DataFrame(data['cityInfo'],columns =['city','cityId','parent','updateTime'],index=[1])#實例化
jsondf
運行結果:
4.7 使用數據庫
4.7.1 sqlite3
import sqlite3 import pandas.io.sql as sql
con = sqlite3.connect()
sql.read_frame('select * from test',con)#con 是一個連接對象
4.7.1 MongoDB
沒裝。先擱置。