利用python 數據分析入門,詳細教程,教小白快速入門


 

  這是一篇的數據的分析的典型案列,本人也是經歷一次從無到有的過程,倍感珍惜,所以將其詳細的記錄下來,用來幫助后來者快速入門,,希望你能看到最后!

  需求:對obo文件進行解析,輸出為json字典格式

  數據的格式如下:

  

  

  我們設定 一個trem or  typedef為一條標簽,一行為一條記錄或者是鍵值對,以此為標准!

  下面我們來對數據進行分析

  數據集中一共包含兩種標簽[trem] and [typedef]兩種標簽,每個標簽下邊有多個鍵值對,和唯一的標識符id,每行記錄以“/n”結尾,且每條標簽下下有多個相同的鍵值對,for examble: is_a,synonym...

 

  算法設計

  1. 數據集中含有【trem】和【typedef】兩種標簽,因此,我們將數據分成兩個數據集分別來進行處理。

  2.循環遍歷數據集,將鍵值對的鍵去除,並且對每一個鍵進行計數,並且進行去重操作

   (我剛開始的想法是根據id的數量於其他的鍵的數量進行比較,找出每個標簽下存在重復的鍵值對,進而確定每個標簽下存在重復的鍵值對 :is_a,有點想多了,呵呵~)

  3.由於發現每條標簽下的記錄的順序都是一定的,id永遠排在前面,用字典的形式存儲是順序是亂的,看上去很不舒服,所以我們相辦法將他存在list里面,最大限度的還原了原有數據。

  4. 處理相同鍵的鍵值對,字典中不允許存在一鍵多值的情況,我們將他存到一個list里面,也就相當於大list里面套小list

  5.對數據集進行遍歷,

    (1)將取出來的鍵值對的鍵值存儲起來

    (2)以“【”作為我們的結束,將鍵值對的值存儲到相對應的鍵下面,也就是一條標簽

    (3)將我們所取得值存儲到匯總在一起,並且對聲明的字典和list進行初始化,方便進行下一次的循環

    (4)進行到這里,我們處理僅僅只是處理完了一個標簽,還需要一個總的list,將所有的標簽都存儲進去

    (這里的算法還是不完善的,我希望看到這篇博客的人可以提出寶貴的建議)

代碼設計以及踩過的坑:

1.打印出所有的鍵

附引用代碼:

復制代碼
 
'''
打印出所有的鍵
'''
with open('go.obo','r',encoding="utf-8") as f: #打開文件

for line in f.readlines(): #對數據進行每一行的循環
list = [] ## 空列表
lable = line.split(":")[0] #讀取列表名,
print(lable)
list.append(lable) ## 使用 append() 向list中添加元素
# print(list)

#print(lable)

# lst2 = list(set(lst1))
# print(lst2)
print(list)
復制代碼
 

 

2.但是在做上一步的時候,出現了一個問題,那就是沒有區分局部變量和全局變量,問題發現的思路,先觀察list輸出的值,發現只有最后一個值,這時候就要考慮值是否被覆蓋,找到問題,於是把list升級為全局變量

附引用代碼:

復制代碼

with open('go.obo','r',encoding="utf-8") as f: #打開文件
# dict = {}
list = [] ## 空列表

for line in f.readlines(): #對數據進行每一行的循環
total = []
lable = line.split(":")[0] #讀取列表名,正確來說讀取完列表名之后,還要進行去重的處理
# print(lable)
# list.append(lable) ## 使用 append() 向list中添加元素
# print(list) 這種操作list中每次都只有一個變量
list.append(lable)



#print(lable)
# lst2 = list(set(lst1))
# print(lst2)

# print(list)
dict = {}
for key in list:
dict[key] = dict.get(key, 0) + 1
print(dict)
 
復制代碼

 

 

 

3.我們將統計的結果輸出在txt中,這個時候問題出現了,輸出的鍵值對中只有鍵沒有值,這就搞笑了,接着往下走

附引用代碼:

復制代碼
'''
將dict在txt中輸出
'''
with open('go.obo', 'r', encoding="utf-8") as f: # 打開文件
# dict = {}
list = [] ## 空列表

for line in f.readlines(): # 對數據進行每一行的循環
total = []
lable = line.split(":")[0] # 讀取列表名,正確來說讀取完列表名之后,還要進行去重的處理
# print(lable)
# list.append(lable) ## 使用 append() 向list中添加元素
# print(list) 這種操作list中每次都只有一個變量
list.append(lable)

# print(lable)
print("################################################")
# lst2 = list(set(lst1))
# print(lst2)

# print(list)
dict = {}
for key in list:
dict[key] = dict.get(key, 0) + 1
print(dict)

fileObject = open('sampleList.txt', 'w')

for ip in dict:
fileObject.write(ip)
fileObject.write('\n')

fileObject.close()
 
復制代碼

 

4.由於我平時處理的json文件比較多,主要面向mongo,所以我試着將其轉化為json格式,發現問題解決了,這里還是很神奇的,但是不明確問題出在什么地方。

附引用代碼:

復制代碼
import json
with open('go.obo', 'r', encoding="utf-8") as f: # 打開文件
# dict = {}
list = [] ## 空列表

for line in f.readlines(): # 對數據進行每一行的循環
total = []
lable = line.split(":")[0] # 讀取列表名,正確來說讀取完列表名之后,還要進行去重的處理
# print(lable)
# list.append(lable) ## 使用 append() 向list中添加元素
# print(list) 這種操作list中每次都只有一個變量
list.append(lable)

# print(lable)
print("################################################")
# lst2 = list(set(lst1))
# print(lst2)

# print(list)
dict = {}
for key in list:
dict[key] = dict.get(key, 0) + 1
print(dict)

fileObject = open('sampleList.txt', 'w')

# for ip in dict:
# fileObject.write(ip)
# fileObject.write('\n')
#
# fileObject.close()

jsObj = json.dumps(dict)

fileObject = open('jsonFile.json', 'w')
fileObject.write(jsObj)
fileObject.close()
復制代碼

 

5.接下來我先實現簡單的測試,抽取部分數據,抽取三個標簽,然后再取標簽里的兩個值

附引用代碼:

復制代碼
with open('nitian','r',encoding="utf-8") as f:         #打開文件
# dic={} #新建的字典
total = [] #列表
newdic = [] #列表


#在這里進行第一次初始化
#這里的每一個字段都要寫兩個
id = {} #
id_number = ""#含有一行的為“”\ 含有一行的為字符串
is_a = {}
is_a_list = []#含有多行的為[] 含有多行的為list


for line in f.readlines(): #對數據進行每一行的循環
lable = line.split(":")[0] #讀取列表名,正確來說讀取完列表名之后,還要進行去重的處理
#print(lable)
#開始判斷
if lable == "id": #冒號前的內容 開始判斷冒號之前的內容
id_number = line[3:] #id 兩個字母+
# 一個冒號
elif lable == "is_a":
is_a_list.append(line[5:].split('\n'))

elif line[0] == "[":
#把數據存入newdic[]中
id["id"] = id_number
newdic.append(id)

is_a["is_a"] = is_a_list
newdic.append(is_a)

#把newdic存入總的里面去
total.append(newdic)
#初始化所有新的標簽
id = {} # 含有一個的為“”\
id_number = ""
is_a = {}
is_a_list = []

#初始化小的newdic
newdic = []

total.append(newdic)

print(total)
復制代碼

 

6.做到這里我們發現問題出了很多,也就是算法設計出現了問題

數據的開頭出現了一系列的空的{id :“ ”} {name:“”} {},{}.....,多了一行初始化,回頭檢查算法,找到問題:我們用的“[”來判斷一個標簽的結束

修改方式(1)將符號“[”作為我們判斷的開始

    (2)修改數據,將數據中的開頭的[term]去掉,加在數據集的結尾

 

7.數據的后面出現了總是出現一些沒有意義的“ ”,我們發現是我們沒有對每個鍵值對后面的標簽進行處理,所以我們引入了strip()函數,但是strip()函數只能作用於字符串,當你想要作用於list時,要先把list里面的東西拿出來,進而進行操作。

8.鍵值對的鍵def 與關鍵字沖突,我們的解決簡單粗暴,直接將其轉化為大寫

9.完整的代碼如下:

附引用代碼:

復制代碼
 
import json


class GeneOntology(object):

def __init__(self, path):
self.path = path
self.total = []

# Use a dictionary to remove extra values to Simplified procedure
# def rebuild_list(self,record_name):
# records = {id,is_a}
#
# list = rebuile_list('HEADER'')
# records.get(record_name)


# Define a function to read and store data
def read_storage_data(self):

id = {} #Use a dictionary to store each keyword
id_number = "" # Store the value of each row as a string

is_obsolete = {}
is_obsolete_number = ""

is_class_level = {}
is_class_level_number = ""

transitive_over = {}
transitive_over_number = ""

# There is a place where the keyword “def” conflicts, so I want to change the name here.
DEF = {}
DEF_number = ""

property_value = {}
property_value_number = ""

namespace = {}
namespace_number = ""

comment = {}
comment_number = ""

intersection_of = {}
intersection_of_number = ""

xref = {}
xref_number = ""

name = {}
name_number = ""

disjoint_from = {}
disjoint_from_number = ""

replaced_by = {}
replaced_by_number = ""

relationship = {}
relationship_number = ""

alt_id = {}
alt_id_number = ""

holds_over_chain = {}
holds_over_chain_number = ""

subset = {}
subset_number = ""

expand_assertion_to = {}
expand_assertion_to_number = ""

is_transitive = {}
is_transitive_number = ""

is_metadata_tag = {}
is_metadata_tag_number = ""

inverse_of = {}
inverse_of_number = ""

created_by = {}
created_by_number = ""

creation_date = {}
creation_date_number = ""

consider = {}
consider_number = ""

is_a = {}
is_a_list = [] # A field name may have multiple values, so it is stored in the form of a “list”.

synonym = {}
synonym_list = []

newdic = []
f = open(self.path, 'r', encoding="utf-8")
for line in f.readlines():
lable = line.split(":")[0] # Read the list ‘name’, starting from the position of '0', ending with ":", reading all field names

# View the name of the list that was read

# print(lable)

# Start to judge

if lable == "id": # Judge the label for storage
id_number = line[3:].strip() # Remove the label and colon, occupy 3 positions, and strip() is used to remove the trailing spaces.

elif lable == "is_obsolete":
is_obsolete_number = line[12:].strip()

elif lable == "is_class_level":
is_class_level_number = line[15:].strip()

elif lable == "transitive_over":
transitive_over_number = line[16:]

elif lable == "def":
DEF_number = line[5:].strip()

elif lable == "property_value":
property_value_number = line[15:].strip()

elif lable == "namespace":
namespace_number = line[10:].strip()

elif lable == "comment":
comment_number = line[8:].strip()

elif lable == "intersection_of":
intersection_of_number = line[16:].strip()

elif lable == "xref":
xref_number = line[5:].strip()

elif lable == "name":
name_number = line[5:].strip()

elif lable == "disjoint_from":
disjoint_from_number = line[14:].strip()

elif lable == "replaced_by":
replaced_by_number = line[12:].strip()

elif lable == "relationship":
relationship_number = line[13:].strip()

elif lable == "alt_id":
alt_id_number = line[7:].strip()

elif lable == "holds_over_chain":
holds_over_chain_number = line[17:].strip()

elif lable == "subset":
subset_number = line[7:].strip()

elif lable == "expand_assertion_to":
expand_assertion_to_number = line[20:].strip()

elif lable == "is_transitive":
is_transitive_number = line[14:].strip()

elif lable == "is_metadata_tag":
is_metadata_tag_number = line[16:].strip()

elif lable == "inverse_of":
inverse_of_number = line[11:].strip()

elif lable == "created_by":
created_by_number = line[11:].strip()

elif lable == "creation_date":
creation_date_number = line[14:].strip()

elif lable == "consider":
consider_number = line[9:].strip()


elif lable == "is_a":
is_a_list.append(line[5:].strip().split('\n'))

elif lable == "synonym":
synonym_list.append(line[8:].strip().split('\n'))




# Put "[" as the end of the store.
# If you want to "[" as the beginning of your storage, you will have to change the storage format of the data.

elif line[0] == "[":

# Assign values ​​and store the data in newdic[]

id["id"] = id_number
newdic.append(id)

is_obsolete["is_obsolete"] = is_obsolete_number
newdic.append(is_obsolete)

is_class_level["is_class_level"] = is_class_level_number
newdic.append(is_class_level)

transitive_over["transitive_over"] = transitive_over_number
newdic.append(transitive_over)

DEF["def"] = DEF_number
newdic.append(DEF)

property_value["property_value"] = property_value_number
newdic.append(property_value)

namespace["namespace"] = namespace_number
newdic.append(namespace)

comment["comment"] = comment_number
newdic.append(comment)

intersection_of["intersection_of"] = intersection_of_number
newdic.append(intersection_of)

xref["xref"] = xref_number
newdic.append(xref)

name["name"] = name_number
newdic.append(name)

disjoint_from["disjoint_from"] = disjoint_from_number
newdic.append(disjoint_from)

replaced_by["replaced_by"] = replaced_by_number
newdic.append(replaced_by)

relationship["relationship"] = relationship_number
newdic.append(relationship)

alt_id["alt_id"] = alt_id_number
newdic.append(alt_id)

holds_over_chain["holds_over_chain"] = holds_over_chain_number
newdic.append(holds_over_chain)

subset["subset"] = subset_number
newdic.append(subset)

expand_assertion_to["expand_assertion_to"] = expand_assertion_to_number
newdic.append(expand_assertion_to)

is_transitive["is_transitive"] = is_transitive_number
newdic.append(is_transitive)

is_metadata_tag["is_metadata_tag"] = is_metadata_tag_number
newdic.append(is_metadata_tag)

inverse_of["inverse_of"] = inverse_of_number
newdic.append(inverse_of)

created_by["created_by"] = created_by_number
newdic.append(created_by)

creation_date["creation_date"] = creation_date_number
newdic.append(creation_date)

consider["consider"] = consider_number
newdic.append(consider)

is_a["is_a"] = is_a_list
newdic.append(is_a)

synonym["synonym"] = synonym_list
newdic.append(synonym)

# Save newdic in the total data set
self.total.append(newdic)

# Initialize all new tags
id = {}
id_number = ""

is_obsolete = {}
is_obsolete_number = ""

is_class_level = {}
is_class_level_number = ""

transitive_over = {}
transitive_over_number = ""

DEF = {}
DEF_number = ""

property_value = {}
property_value_number = ""

namespace = {}
namespace_number = ""

comment = {}
comment_number = ""

intersection_of = {}
intersection_of_number = ""

xref = {}
xref_number = ""

name = {}
name_number = ""

disjoint_from = {}
disjoint_from_number = ""

replaced_by = {}
replaced_by_number = ""

relationship = {}
relationship_number = ""

alt_id = {}
alt_id_number = ""

holds_over_chain = {}
holds_over_chain_number = ""

subset = {}
subset_number = ""

expand_assertion_to = {}
expand_assertion_to_number = ""

is_transitive = {}
is_transitive_number = ""

is_metadata_tag = {}
is_metadata_tag_number = ""

inverse_of = {}
inverse_of_number = ""

created_by = {}
created_by_number = ""

creation_date = {}
creation_date_number = ""

is_a = {}
is_a_list = []

synonym = {}
synonym_list = []

# Initialize newdic
newdic = []

# total.append(newdic)
# self.total.append(newdic) #You append an empty newdic, so there is an empty one behind []


if __name__ == "__main__":
class1 = GeneOntology('go (1).obo')
class1.read_storage_data()
print(class1.total)

jsObj = json.dumps(class1.total)
fileObject = open('jsonFile8.json', 'w')
fileObject.write(jsObj)
fileObject.close()
復制代碼
 
 
10.總結,做到這一塊,也發現了自己的不足,不能依賴於代碼,重要的自己思考問題的過程,將自己的邏輯表達清楚,然后哪里出問題,在查找相應的解決方法!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM