利用python 數據分析入門，詳細教程，教小白快速入門

本文轉載自查看原文 2018-12-08 00:53 3447

　　這是一篇的數據的分析的典型案列，本人也是經歷一次從無到有的過程，倍感珍惜，所以將其詳細的記錄下來，用來幫助后來者快速入門，，希望你能看到最后！

　　需求：對obo文件進行解析，輸出為json字典格式

　　數據的格式如下：

　　我們設定一個trem or typedef為一條標簽，一行為一條記錄或者是鍵值對，以此為標准！

　　下面我們來對數據進行分析：

　　數據集中一共包含兩種標簽[trem] and [typedef]兩種標簽，每個標簽下邊有多個鍵值對，和唯一的標識符id，每行記錄以“/n”結尾，且每條標簽下下有多個相同的鍵值對，for examble: is_a，synonym...

　　算法設計：

　　1. 數據集中含有【trem】和【typedef】兩種標簽，因此，我們將數據分成兩個數據集分別來進行處理。

　　2.循環遍歷數據集，將鍵值對的鍵去除，並且對每一個鍵進行計數，並且進行去重操作

　　　（我剛開始的想法是根據id的數量於其他的鍵的數量進行比較，找出每個標簽下存在重復的鍵值對，進而確定每個標簽下存在重復的鍵值對：is_a，有點想多了，呵呵~）

　　3.由於發現每條標簽下的記錄的順序都是一定的，id永遠排在前面，用字典的形式存儲是順序是亂的，看上去很不舒服，所以我們相辦法將他存在list里面，最大限度的還原了原有數據。

　　4. 處理相同鍵的鍵值對，字典中不允許存在一鍵多值的情況，我們將他存到一個list里面，也就相當於大list里面套小list

　　5.對數據集進行遍歷，

　　　　（1）將取出來的鍵值對的鍵值存儲起來

　　　　（2）以“【”作為我們的結束，將鍵值對的值存儲到相對應的鍵下面，也就是一條標簽

　　　　（3）將我們所取得值存儲到匯總在一起，並且對聲明的字典和list進行初始化，方便進行下一次的循環

　　　　（4）進行到這里，我們處理僅僅只是處理完了一個標簽，還需要一個總的list,將所有的標簽都存儲進去

　　　　（這里的算法還是不完善的，我希望看到這篇博客的人可以提出寶貴的建議）

代碼設計以及踩過的坑：

1.打印出所有的鍵

附引用代碼：

'''
打印出所有的鍵
'''
with open('go.obo','r',encoding="utf-8") as f:         #打開文件

    for  line in f.readlines():                         #對數據進行每一行的循環
        list = []  ## 空列表
        lable = line.split(":")[0]                      #讀取列表名，
        print(lable)
        list.append(lable)                   ## 使用 append() 向list中添加元素
        # print(list)

        #print(lable)

    # lst2 = list(set(lst1))
    # print(lst2)
    print(list)

2.但是在做上一步的時候，出現了一個問題，那就是沒有區分局部變量和全局變量，問題發現的思路，先觀察list輸出的值，發現只有最后一個值，這時候就要考慮值是否被覆蓋，找到問題，於是把list升級為全局變量

附引用代碼：


with open('go.obo','r',encoding="utf-8") as f:         #打開文件
    # dict = {}
    list = []  ## 空列表

    for  line in f.readlines():                         #對數據進行每一行的循環
        total = []
        lable = line.split(":")[0]                      #讀取列表名，正確來說讀取完列表名之后，還要進行去重的處理
        # print(lable)
        # list.append(lable)                   ## 使用 append() 向list中添加元素
        # print(list)                            這種操作list中每次都只有一個變量
        list.append(lable)



        #print(lable)
    # lst2 = list(set(lst1))
    # print(lst2)

    # print(list)
    dict = {}
    for key in list:
        dict[key] = dict.get(key, 0) + 1
    print(dict)

3.我們將統計的結果輸出在txt中，這個時候問題出現了，輸出的鍵值對中只有鍵沒有值，這就搞笑了，接着往下走

附引用代碼：

'''
將dict在txt中輸出
'''
with open('go.obo', 'r', encoding="utf-8") as f:  # 打開文件
    # dict = {}
    list = []  ## 空列表

    for line in f.readlines():  # 對數據進行每一行的循環
        total = []
        lable = line.split(":")[0]  # 讀取列表名，正確來說讀取完列表名之后，還要進行去重的處理
        # print(lable)
        # list.append(lable)                   ## 使用 append() 向list中添加元素
        # print(list)                            這種操作list中每次都只有一個變量
        list.append(lable)

        # print(lable)
    print("################################################")
    # lst2 = list(set(lst1))
    # print(lst2)

    # print(list)
    dict = {}
    for key in list:
        dict[key] = dict.get(key, 0) + 1
    print(dict)

fileObject = open('sampleList.txt', 'w')

for ip in dict:
   fileObject.write(ip)
   fileObject.write('\n')

fileObject.close()

4.由於我平時處理的json文件比較多，主要面向mongo，所以我試着將其轉化為json格式，發現問題解決了，這里還是很神奇的，但是不明確問題出在什么地方。

附引用代碼：

import json
with open('go.obo', 'r', encoding="utf-8") as f:  # 打開文件
    # dict = {}
    list = []  ## 空列表

    for line in f.readlines():  # 對數據進行每一行的循環
        total = []
        lable = line.split(":")[0]  # 讀取列表名，正確來說讀取完列表名之后，還要進行去重的處理
        # print(lable)
        # list.append(lable)                   ## 使用 append() 向list中添加元素
        # print(list)                            這種操作list中每次都只有一個變量
        list.append(lable)

        # print(lable)
    print("################################################")
    # lst2 = list(set(lst1))
    # print(lst2)

    # print(list)
    dict = {}
    for key in list:
        dict[key] = dict.get(key, 0) + 1
    print(dict)

fileObject = open('sampleList.txt', 'w')

# for ip in dict:
#  fileObject.write(ip)
#  fileObject.write('\n')
#
# fileObject.close()

jsObj = json.dumps(dict)

fileObject = open('jsonFile.json', 'w')
fileObject.write(jsObj)
fileObject.close()

5.接下來我先實現簡單的測試，抽取部分數據，抽取三個標簽，然后再取標簽里的兩個值

附引用代碼：

with open('nitian','r',encoding="utf-8") as f:         #打開文件
    # dic={}                                           #新建的字典
    total = []                                         #列表
    newdic = []                                        #列表


    #在這里進行第一次初始化
    #這里的每一個字段都要寫兩個
    id = {}  #
    id_number = ""#含有一行的為“”\            含有一行的為字符串
    is_a = {}
    is_a_list = []#含有多行的為[]               含有多行的為list


    for  line in f.readlines():                         #對數據進行每一行的循環
        lable = line.split(":")[0]                      #讀取列表名，正確來說讀取完列表名之后，還要進行去重的處理
        #print(lable)
        #開始判斷
        if lable == "id":   #冒號前的內容                開始判斷冒號之前的內容
            id_number = line[3:] #id 兩個字母+
            # 一個冒號
        elif lable == "is_a":
            is_a_list.append(line[5:].split('\n'))

        elif line[0] == "[":
            #把數據存入newdic[]中
            id["id"] = id_number
            newdic.append(id)

            is_a["is_a"] = is_a_list
            newdic.append(is_a)

            #把newdic存入總的里面去
            total.append(newdic)
            #初始化所有新的標簽
            id = {}  # 含有一個的為“”\
            id_number = ""
            is_a = {}
            is_a_list = []

            #初始化小的newdic
            newdic = []

    total.append(newdic)

print(total)

6.做到這里我們發現問題出了很多，也就是算法設計出現了問題

數據的開頭出現了一系列的空的{id :“ ”} {name:“”} {},{}.....,多了一行初始化，回頭檢查算法,找到問題：我們用的“[”來判斷一個標簽的結束

修改方式（1）將符號“[”作為我們判斷的開始

　　　　（2）修改數據，將數據中的開頭的[term]去掉，加在數據集的結尾

7.數據的后面出現了總是出現一些沒有意義的“ ”，我們發現是我們沒有對每個鍵值對后面的標簽進行處理，所以我們引入了strip()函數，但是strip()函數只能作用於字符串，當你想要作用於list時，要先把list里面的東西拿出來，進而進行操作。

8.鍵值對的鍵def 與關鍵字沖突，我們的解決簡單粗暴，直接將其轉化為大寫

9.完整的代碼如下：

附引用代碼：

import json


class GeneOntology(object):

    def __init__(self, path):
        self.path = path
        self.total = []

    # Use a dictionary to remove extra values to Simplified procedure
    # def rebuild_list(self,record_name):
    #     records = {id,is_a}
    #
    #     list = rebuile_list('HEADER'')
    #     records.get(record_name)


    # Define a function to read and store data
    def read_storage_data(self):

        id = {}         #Use a dictionary to store each keyword
        id_number = ""  # Store the value of each row as a string

        is_obsolete = {}
        is_obsolete_number = ""

        is_class_level = {}
        is_class_level_number = ""

        transitive_over = {}
        transitive_over_number = ""

        # There is a place where the keyword “def” conflicts, so I want to change the name here.
        DEF = {}
        DEF_number = ""

        property_value = {}
        property_value_number = ""

        namespace = {}
        namespace_number = ""

        comment = {}
        comment_number = ""

        intersection_of = {}
        intersection_of_number = ""

        xref = {}
        xref_number = ""

        name = {}
        name_number = ""

        disjoint_from = {}
        disjoint_from_number = ""

        replaced_by = {}
        replaced_by_number = ""

        relationship = {}
        relationship_number = ""

        alt_id = {}
        alt_id_number = ""

        holds_over_chain = {}
        holds_over_chain_number = ""

        subset = {}
        subset_number = ""

        expand_assertion_to = {}
        expand_assertion_to_number = ""

        is_transitive = {}
        is_transitive_number = ""

        is_metadata_tag = {}
        is_metadata_tag_number = ""

        inverse_of = {}
        inverse_of_number = ""

        created_by = {}
        created_by_number = ""

        creation_date = {}
        creation_date_number = ""

        consider = {}
        consider_number = ""

        is_a = {}
        is_a_list = []  # A field name may have multiple values, so it is stored in the form of a “list”.

        synonym = {}
        synonym_list = []

        newdic = []
        f = open(self.path, 'r', encoding="utf-8")
        for line in f.readlines():
            lable = line.split(":")[0]        # Read the list ‘name’, starting from the position of '0', ending with ":", reading all field names

            # View the name of the list that was read

            # print(lable)

            # Start to judge

            if lable == "id":                 # Judge the label for storage
                id_number = line[3:].strip()  # Remove the label and colon, occupy 3 positions, and strip() is used to remove the trailing spaces.

            elif lable == "is_obsolete":
                is_obsolete_number = line[12:].strip()

            elif lable == "is_class_level":
                is_class_level_number = line[15:].strip()

            elif lable == "transitive_over":
                transitive_over_number = line[16:]

            elif lable == "def":
                DEF_number = line[5:].strip()

            elif lable == "property_value":
                property_value_number = line[15:].strip()

            elif lable == "namespace":
                namespace_number = line[10:].strip()

            elif lable == "comment":
                comment_number = line[8:].strip()

            elif lable == "intersection_of":
                intersection_of_number = line[16:].strip()

            elif lable == "xref":
                xref_number = line[5:].strip()

            elif lable == "name":
                name_number = line[5:].strip()

            elif lable == "disjoint_from":
                disjoint_from_number = line[14:].strip()

            elif lable == "replaced_by":
                replaced_by_number = line[12:].strip()

            elif lable == "relationship":
                relationship_number = line[13:].strip()

            elif lable == "alt_id":
                alt_id_number = line[7:].strip()

            elif lable == "holds_over_chain":
                holds_over_chain_number = line[17:].strip()

            elif lable == "subset":
                subset_number = line[7:].strip()

            elif lable == "expand_assertion_to":
                expand_assertion_to_number = line[20:].strip()

            elif lable == "is_transitive":
                is_transitive_number = line[14:].strip()

            elif lable == "is_metadata_tag":
                is_metadata_tag_number = line[16:].strip()

            elif lable == "inverse_of":
                inverse_of_number = line[11:].strip()

            elif lable == "created_by":
                created_by_number = line[11:].strip()

            elif lable == "creation_date":
                creation_date_number = line[14:].strip()

            elif lable == "consider":
                consider_number = line[9:].strip()


            elif lable == "is_a":
                is_a_list.append(line[5:].strip().split('\n'))

            elif lable == "synonym":
                synonym_list.append(line[8:].strip().split('\n'))




            # Put "[" as the end of the store.
            # If you want to "[" as the beginning of your storage, you will have to change the storage format of the data.

            elif line[0] == "[":

                # Assign values and store the data in newdic[]

                id["id"] = id_number
                newdic.append(id)

                is_obsolete["is_obsolete"] = is_obsolete_number
                newdic.append(is_obsolete)

                is_class_level["is_class_level"] = is_class_level_number
                newdic.append(is_class_level)

                transitive_over["transitive_over"] = transitive_over_number
                newdic.append(transitive_over)

                DEF["def"] = DEF_number
                newdic.append(DEF)

                property_value["property_value"] = property_value_number
                newdic.append(property_value)

                namespace["namespace"] = namespace_number
                newdic.append(namespace)

                comment["comment"] = comment_number
                newdic.append(comment)

                intersection_of["intersection_of"] = intersection_of_number
                newdic.append(intersection_of)

                xref["xref"] = xref_number
                newdic.append(xref)

                name["name"] = name_number
                newdic.append(name)

                disjoint_from["disjoint_from"] = disjoint_from_number
                newdic.append(disjoint_from)

                replaced_by["replaced_by"] = replaced_by_number
                newdic.append(replaced_by)

                relationship["relationship"] = relationship_number
                newdic.append(relationship)

                alt_id["alt_id"] = alt_id_number
                newdic.append(alt_id)

                holds_over_chain["holds_over_chain"] = holds_over_chain_number
                newdic.append(holds_over_chain)

                subset["subset"] = subset_number
                newdic.append(subset)

                expand_assertion_to["expand_assertion_to"] = expand_assertion_to_number
                newdic.append(expand_assertion_to)

                is_transitive["is_transitive"] = is_transitive_number
                newdic.append(is_transitive)

                is_metadata_tag["is_metadata_tag"] = is_metadata_tag_number
                newdic.append(is_metadata_tag)

                inverse_of["inverse_of"] = inverse_of_number
                newdic.append(inverse_of)

                created_by["created_by"] = created_by_number
                newdic.append(created_by)

                creation_date["creation_date"] = creation_date_number
                newdic.append(creation_date)

                consider["consider"] = consider_number
                newdic.append(consider)

                is_a["is_a"] = is_a_list
                newdic.append(is_a)

                synonym["synonym"] = synonym_list
                newdic.append(synonym)

                # Save newdic in the total data set
                self.total.append(newdic)

                # Initialize all new tags
                id = {}
                id_number = ""

                is_obsolete = {}
                is_obsolete_number = ""

                is_class_level = {}
                is_class_level_number = ""

                transitive_over = {}
                transitive_over_number = ""

                DEF = {}
                DEF_number = ""

                property_value = {}
                property_value_number = ""

                namespace = {}
                namespace_number = ""

                comment = {}
                comment_number = ""

                intersection_of = {}
                intersection_of_number = ""

                xref = {}
                xref_number = ""

                name = {}
                name_number = ""

                disjoint_from = {}
                disjoint_from_number = ""

                replaced_by = {}
                replaced_by_number = ""

                relationship = {}
                relationship_number = ""

                alt_id = {}
                alt_id_number = ""

                holds_over_chain = {}
                holds_over_chain_number = ""

                subset = {}
                subset_number = ""

                expand_assertion_to = {}
                expand_assertion_to_number = ""

                is_transitive = {}
                is_transitive_number = ""

                is_metadata_tag = {}
                is_metadata_tag_number = ""

                inverse_of = {}
                inverse_of_number = ""

                created_by = {}
                created_by_number = ""

                creation_date = {}
                creation_date_number = ""

                is_a = {}
                is_a_list = []

                synonym = {}
                synonym_list = []

                # Initialize newdic
                newdic = []

            # total.append(newdic)
        # self.total.append(newdic)             #You append an empty newdic, so there is an empty one behind []


if __name__ == "__main__":
    class1 = GeneOntology('go (1).obo')
    class1.read_storage_data()
    print(class1.total)

    jsObj = json.dumps(class1.total)
    fileObject = open('jsonFile8.json', 'w')
    fileObject.write(jsObj)
    fileObject.close()

10.總結，做到這一塊，也發現了自己的不足，不能依賴於代碼，重要的自己思考問題的過程，將自己的邏輯表達清楚，然后哪里出問題，在查找相應的解決方法！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用python進行數據分析之pandas入門 Python數據分析入門《利用python進行數據分析》讀書筆記--第五章 pandas入門利用Python進行簡單數據分析--醫院銷售數據分析案例數據分析入門 Python數據分析從入門到精通視頻教程教學視頻教程 | 一文入門Python數據分析庫Pandas 小白學數據分析-----> 利用SPSS對DAU/MAU進行比率分析【目錄】利用Python進行數據分析(第2版) 利用Python數據分析-Numpy和Pands篇