[數據科學] 從csv, xls文件中提取數據


在python語言中,用豐富的函數庫來從文件中提取數據,這篇博客講解怎么從csv, xls文件中得到想要的數據。

點擊下載數據文件http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip 

 

這個一個美國棒球比賽的統計數據
解壓文件夾,我們選取AwardsManagers.csv來練習

#-*- coding:utf-8 -*-
import csv
DIR = 'data/'
fname = 'AwardsManagers.csv'
fpath = DIR+fname

## 用 with open() as filename 的結構非常優美, 而且不需要寫代碼來關文件
## 省去了fileobj.close(), 省去寫try-finally的麻煩來出來exception

with open(fpath, 'rb') as csvfile: 
    ## delimiter是csv文件每行中數據間隔開的符號,常用是comma逗號,
    ## quotechar之間包括特殊字符
    mreader = csv.reader(csvfile, delimiter=',', quotechar='|') 

    ## 讀出每一行都是一個list
    first_row = mreader.next()
    print first_row
    print type(first_row)
    ## 目前的行數
    print mreader.line_num
    for row in mreader:  
        print ', '.join(row) 

## 另外一個讀取數據的方法是用DictReader
names = ['playerID','awardID','yearID','lgID','tie','notes']
with open(fpath) as csvfile: 
    ## fieldnames指明了csv文件的列名稱   
    reader = csv.DictReader(csvfile, fieldnames=names, 
        delimiter=',', quotechar='|')     
    for row in reader:   
        ## 每一行都是一個dict對象
        print(row[names[0]], row[names[1], row[names[2])

從專業機構中獲取的數據也常常是XLS文件,用python提取XLS文件中的函數是xlrd

在xlrd中最重要的函數是:
xlrd.open_workbook
workbook.sheet_by_name
workbook.sheet_by_index
sheet.cell(row_index, col_index)
cell.value
sheet.col_values(col_index, start_row_index, end_row_index)
sheet.row_values(row_index, start_col_index, end_col_index)
sheet.col_slice(col_index, start_row_index, end_row_index)
sheet.row_slice(row_index, start_col_index, end_col_index)
點擊下載數據源文件http://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&33010do001_2009.xls&3301.0&Data%20Cubes&861A1F351DF2D978CA2577CF000DF18E&0&2009&03.11.2010&Latest
文件是關於澳大利亞人口出生情況的統計數據

#-*- coding:utf-8 -*-
import xlrd

DIR = 'C:/Users/Lucas/Downloads/'
fname = '33010do001_2009.xls'

# 首先建立workbook
mworkbook = xlrd.open_workbook(DIR+fname)

# 打印出所有sheetnames
sheet_names = mworkbook.sheet_names()
print('Sheet Names', sheet_names)

# 選取第二個sheet
msheet = mworkbook.sheet_by_name(sheet_names[1])

# 或者通過index得到sheet
nsheet = mworkbook.sheet_by_index(1)
print ('Sheet name: %s' % nsheet.name)

# Pull the first row by index
row = msheet.row(0)  

# Pull the first row by index
row = msheet.row(4) 
# Print 1st row values and types
for cell in row:
    print cell.value

# Print all values, iterating through rows and columns
#
num_cols = msheet.ncols   # Number of columns
num_rows = msheet.nrows   # Number of rows
for row_idx in range(0, num_rows):    # Iterate through rows
    row_values = []
    for col_idx in range(0, num_cols):  # Iterate through columns
        row_values.append([msheet.cell(row_idx, col_idx).value])

    ## 輸出每行數據
    print row_values

## 用col_slice得到某一列的數據
col_cells = msheet.col_slice(2, 4, num_rows)
for cell in col_cells:
    print("-"*6)
    print cell.value
        
## 用col_valeus得到某一列的數據
col_values = msheet.col_values(2, 4, num_rows)
print col_values        

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM