Python中基本的讀文件和簡單數據處理
暫無評論
DataQuest上面的免費課程(本文是Python基礎課程部分),里面有些很基礎的東西(csv文件讀,字符串預處理等),發在這里做記錄。涉及下面六個案例:
- Find the lowest crime rate(讀取csv文件,字符串切分,for循環和if判斷過濾數據)
- Discover weather pattern in LA(for循環和if判斷進行頻數統計)
- Building a Spell Checker(詞頻統計,字符串預處理,字典跑字符串,統計正確錯誤單詞)
- Analyze NFL data(使用CSVmodule導入文件,類,函數,使用字典和list進行簡單統計)
- What should you name your kid if you want them to be a US Congressperson?(數據預處理,強制類型轉換int(),try-except語句,字典方式統計,轉存需要數據)
- Which airline is delayed the most?
- 附錄:逐行讀取txt文件
案例1 Find the lowest crime rate
(讀取csv文件,字符串切分,for循環和if判斷過濾數據)
crime_rates.csv是單sheet,73Rows,2Cols的文件。第一列是城市名稱(字符串),第二列是犯罪數量(整數)。但是讀入Python開始都是字符串,在后面類型轉換將字符串形式的犯罪數量強制轉換成整型。 並將分隔開轉換后的數據存到full_data這個list中,然后使用for循環將犯罪數量最小的城市找出來(if判斷,已知犯罪數最小為130),並將這個城市名存入變量city中。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# We know that the lowest crime rate is 130.
# This is the second column of the data.
# We need to find the corresponding value in the first column -- the city with the lowest crime rate.
# Let's load the csv file
f = open('crime_rates.csv', 'r')
data = f.read()
rows = data.split('\n')
full_data = []
for row in rows:
split_row = row.split(",")
split_row[1] = int(split_row[1])
full_data.append(split_row)
city = ""
lowest_crime_rate = 10000
for item in full_data:
if item[1] == 130:
city = item[0]
|
案例2 Discover weather pattern in LA
(for循環和if判斷進行頻數統計)
兩列數據的文本文件,有表頭。導入la_weather.txt文本文件,切分,存入變量weather_data中,去掉表頭。使用字典(dictionary)進行不同類型的頻數統計。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
weather_data = []
f = open("la_weather.csv", 'r')
data = f.read()
rows = data.split('\n')
for row in rows:
split_row = row.split(",")
weather_data.append(split_row)
print(weather_data)
#去掉表頭
weather = weather_data[1:367]
weather_counts = {}
for item in weather:
if item in weather_counts:
weather_counts[item] = weather_counts[item] + 1
else:
weather_counts[item] = 1
print(weather_counts)
|
案例3 Building a Spell Checker
(詞頻統計,字符串預處理,字典跑字符串,統計正確錯誤單詞)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
# 將字符正規化,對字符進行處理,去掉特殊符號
def normalize(token):
token = token.replace(".","")
token = token.replace(",","")
token = token.replace("'", "")
token = token.replace(";", "")
token = token.replace("\n", "")
token = token.lower()
return token
# 建立一個list用於存放正規的字典
normalized_dictionary_tokens = []
# 只讀方式打開一個文件
f = open("dictionary.txt", "r")
raw_data = f.read()
# 按照空格將字符串進行切分,成單個單詞
data = raw_data.split(" ")
# 遍歷切分后的單詞,進行正規化處理(def normalize,去掉特殊符號)
for token in data:
normalized_dictionary_tokens.append(normalize(token))
print(normalized_dictionary_tokens)
#統計正確單詞和錯誤單詞的詞頻。用一個正確單詞的字典來遍歷這個字符串,並進行統計
potential_misspellings = []
correctly_spelled = []
for token in normalized_story_tokens:
if token in normalized_dictionary_tokens:
correctly_spelled.append(token)
else:
potential_misspellings.append(token)
print(correctly_spelled)
print(potential_misspellings)
|
案例4 Analyze NFL data
(使用CSVmodule導入文件,類,函數,使用字典和list進行簡單統計)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
import csv
class Team():
def __init__(self, name):
self.name = name
f = open("nfl.csv", 'r')
csvreader = csv.reader(f)
self.nfl = list(csvreader)
def count_total_wins(self):
count = 0
for row in self.nfl:
if row[2] == self.name:
count = count + 1
return count
def wins_by_years(self):
wins = {}
years = ["2009", "2010", "2011", "2012", "2013"]
for year in years:
count = 0
for row in self.nfl:
if row[2] == self.name and row[0] == year:
count += 1
wins[year] = count
return wins
niners = Team("San Francisco 49ers")
niners_wins_by_year = niners.wins_by_years()
print("Niners_wins_by_year: ", niners_wins_by_year)
|
案例5 What should you name your kid if you want them to be a US Congressperson?
(數據預處理,強制類型轉換int(),try-except語句,字典方式統計,轉存需要數據)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
# legislators變量是一個2維list,大list里的其中一個list(條目)是一個有7個元素組成的(姓,名,出生年月日,未知,未知,未知)。我們要做的是這一組數據進行預處理,然后進行姓名的統計。
genders_list = []
unique_genders = set()
unique_genders_list = []
# 將性別數據以append方式挨個讀入list變量genders_list中去
for row in legislators:
genders_list.append(row[3])
# genders_list變量使用set()函數進行元素去重變為字典,並存入字典變量unique_genders中,將去重后的結果再存儲成list類型數據搭配到變量unique_genders_list
unique_genders = set(genders_list)
unique_genders_list = list(unique_genders)
print(genders_list)
# 已知性別數據的錯誤值為"",將其重賦值為“M”
for row in legislators:
if row[3] == "":
row[3] = "M"
# 統計出生年份存入list變量birth_years中。其中需要使用split方法對list中的某個元素進行切分,取其中第一個元素(即年),以append追加的方法存入list變量birth_years中
birth_years = []
for row in legislators:
birth_list = []
birth_list = row[2].split("-")
birth_years.append(birth_list[0])
# 對list變量進行enumerate()函數操作(得到下標和所在的當前row)類似對字典進行.item()方法(得到key和對應的value)。
# 將年份存入list變量legislators中每行的第八列,按照append追加的方法
for i, row in enumerate(legislators):
row.append(birth_years[i])
# 將legislatros變量的第八列元素(出生年份)的字符串類型,強制類型轉換成int類型。如遇到強制轉換錯誤就將出生年份值變為0
for row in legislators:
try:
row[7] = int(row[7])
except Exception:
row[7] = 0
# 用字典進行姓名統計(key為姓名,value為出現次數)存入male_name_counts字典變量中。並將出現次數最多的名字(同樣是最大出現次數,但名字不止一個),將這些名字存入list變量top_male_names中
top_male_names = []
male_name_counts = {}
# 用字典進行姓名統計,條件是出生年份大於1940,並且是女性
for row in legislators:
if row[7] > 1940 and row[3] == "M":
if row[1] in male_name_counts:
male_name_counts[row[1]] += 1
else:
male_name_counts[row[1]] = 1
# 找出名字出現最多的次數highest_value
highest_value = None
for key, value in male_name_counts.items():
if highest_value is None or value > highest_value:
highest_value = value
# 將名字次數出現最多的名字(同樣是最大出現次數,但名字不止一個),將這些名字以追加append的方式存入list變量top_male_names中
for key, value in male_name_counts.items():
if value == highest_value:
top_male_names.append(key)
|
案例6 Which airline is delayed the most?
這個案例來來回回做了好幾天,反正基本上大都是參考答案做過的……醬油了……
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
def column_number_from_name(column_name):
column_number = None
for i, column in enumerate(column_names):
if column == column_name:
column_number = i
return column_number
def find_average_delay(carrier_name=None):
total_delayed_flights = 0
total_delay_time = 0
delay_time_column = column_number_from_name("arr_delay")
delay_number_column = column_number_from_name("arr_del15")
carrier_column = column_number_from_name("carrier")
for row in flight_delays:
if carrier_name is None or row[carrier_column] == carrier_name:
total_delayed_flights += float(row[delay_number_column])
total_delay_time += float(row[delay_time_column])
return total_delay_time / total_delayed_flights
delays_by_carrier = {}
carrier_column = column_number_from_name("carrier")
carriers = [row[carrier_column] for row in flight_delays]
unique_carriers = list(set(carriers))
for carrier in unique_carriers:
delays_by_carrier[carrier] = find_average_delay(carrier)
print(delays_by_carrier)
|
附錄1 逐行讀取txt文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# 方法一
f = open("foo.txt") # 返回一個文件對象
line = f.readline() # 調用文件的 readline()方法
while line:
print line, # 后面跟 ',' 將忽略換行符
# print(line, end = '') # 在 Python 3中使用
line = f.readline()
f.close()
# 方法二
for line in open("foo.txt"):
print line
# 方法三
f = open("c:\\1.txt","r")
lines = f.readlines() #讀取全部內容
for line in lines
print line
|