第一次個人編程作業

本文轉載自查看原文 2020-09-15 01:13 510

論文相似度計算系統

算法及流程圖

在不斷地使用搜索引擎和翻看各路大佬的博客之后，決定使用比較簡單的算法（老懶狗了），直接用jieba分詞（關於jieba的使用，GitHub上有詳細的說明）並計算各個詞的TF-IDF值（停用詞表采用的是jieba自帶的，TF-IDF語料庫也是采用jieba自帶的），再用TF-IDF值構造詞向量，最后計算余弦相似度（雖然算法簡單，但貌似效果還不錯）。

有關 TF-IDF 具體可以參考阮一峰老師的兩篇博客：
- TF-IDF與余弦相似性的應用（一）：自動提取關鍵詞
- TF-IDF與余弦相似性的應用（二）：找出相似文章
我和博客中的算法稍稍有些不一樣，博客中的相似度計算的方法，是先從兩篇文章中各提取出TF-IDF值最高的20個關鍵字，取並集，然后根據這些關鍵詞的出現頻率生成兩個詞向量。
流程圖

計算模塊接口的設計與實現過程

（這么頭疼的玩意當然是用 Python 比較爽了）

類的定義

只定義了一個 Document 類，用來存放讀取后的文本，分詞的結果及每個詞對應的 TF-IDF 的字典。

初始化的時候從給定的路徑中讀取文本，然后調用 self.__analyse( ) 方法分析文本

    def __init__(self, path):

        self.__path = path
        self.__words = []
        self.__tfidf = {}

        # read txt file
        try:
            self.__file = open(path, encoding='utf-8')
            self.__text = self.__file.read()
            self.__file.close()
        except:
            print("無法讀取 %s"%(self.__path))
            raise

        # analyse text
        self.__analyse()

self.__analyse( ) 方法通過調用 jieba.analyse.extract_tags( ) 分詞並獲取各個詞的TF-IDF值

    # analyse text
    def __analyse(self):
        for word,tfidf in jieba.analyse.extract_tags(self.__text, topK=0, withWeight=True):
            self.__words.append(word)
            self.__tfidf[word] = tfidf
        # print(self.__words)

提供了兩個接口函數

get_words( ) 用來獲取文本中的詞匯，返回的是一個列表。

# get words list
def get_words(self) -> list:
    return self.__words

get_tfidf( ) 用來查詢某個詞在該文本中的TF-IDF值，如果文本中沒有該詞則返回0。

# get TF-IDF
def get_tfidf(self, word) -> float:
    return self.__tfidf.get(word, 0)

相似度計算函數

參數為兩個 Document 對象，返回兩個文本的余弦相似度。

# caculate similiarity
def caculate_similarity(doc_1:Document, doc_2:Document) -> float:

    keywords_1 = set(doc_1.get_words())
    keywords_2 = set(doc_2.get_words())
    if len(keywords_1) == 0 or len(keywords_2) == 0:
        print("文本無有效詞匯，請輸入包含有效詞匯的文本路徑")
        raise

    keywords = keywords_1 and keywords_2

    # build vector
    vector = []
    for keyword in keywords:
        vector.append((doc_1.get_tfidf(keyword), doc_2.get_tfidf(keyword)))

    # caculate cosine_similiarity
    v_1 = 0
    v_2 = 0
    v = 0
    for i in vector:
        v_1 += i[0] ** 2
        v_2 += i[1] ** 2
        v += (i[0] * i[1])

    similiarity = v / math.sqrt(v_1 * v_2)


    # return caculate result
    return similiarity

計算模塊接口部分的性能改進

用pycharm自帶的 profile （pycharm永遠滴神）分析了性能，生成了以下圖片。

放大之后可以看到

總共耗時4秒多，說實話挺慢的。

但再細看各個部分

基本上是jieba占據了大部分時間開銷。
```
jieba.analyse.extract_tags(self.__text, topK=0, withWeight=True)
```
這個函數可以用來分詞（自動去除了標點、停用詞等）並返回各個詞的TF-IDF值的，最開始我是采用分詞后直接用詞頻率構造詞向量去算相似度，雖然在效果上可能沒有TF-IDF來的好，但明顯快多了。（寫完了才知道有gensim這玩意，應該會比這個快）

可以看到計算的部分其實耗時是非常少的：

計算模塊部分單元測試展示

看着別人的博客照葫蘆畫瓢寫了幾個測試，第一次搞這玩意屬實很懵。

一共測試了提供的九個文本和兩個異常處理，一個是文本為空的時候拋出 RuntimeError，還有一個是路徑不存在時拋出 FileNotFoundError 。

C:\Users\小明\venv\Scripts\python.exe "H:/Workspace/Software Engineering/paper_check_system/ut.py"
test add
0.8601768754747804
test over!

test del
..0.8985891244539704
test over!

test dis_1
.0.9808112866546317
test over!

test dis_10
.0.9233929083371283
test over!

test dis_15
0.7122391923479036
test over!

test dis_3
..0.9613218437864147
test over!

test dis_7
..0.9511418312250954
test over!

test mix
..0.9368555273793027
test over!

test rep
0.7963167032716033
test over!

test err_1
無法讀取 123
test over!

test err_2
文本無有效詞匯，請輸入包含有效詞匯的文本路徑
test over!


.
----------------------------------------------------------------------
Ran 11 tests in 6.066s

OK


Process finished with exit code 0

生成的 coverage 報告可以看出 document.py已經100%覆蓋
下面是單元測試的代碼：

import unittest
import document
import logging
import jieba

class MyClassTest(unittest.TestCase):
    def setUp(self) -> None:
        jieba.setLogLevel(logging.INFO)


    def tearDown(self) -> None:
        print("test over!")


    def test_add(self):
        print('test add')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_add.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)
    def test_del(self):
        print('test del')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_del.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)

    def test_dis_1(self):
        print('test dis_1')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_dis_1.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)

    def test_dis_3(self):
        print('test dis_3')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_dis_3.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)

    def test_dis_7(self):
        print('test dis_7')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_dis_7.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)
    def test_dis_10(self):
        print('test dis_10')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_dis_10.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)
    def test_dis_15(self):
        print('test dis_15')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_dis_15.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)
    def test_mix(self):
        print('test mix')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_mix.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)
    def test_rep(self):
        print('test rep')
        doc_1 = document.Document(r"sim_0.8/orig.txt")
        doc_2 = document.Document(r"sim_0.8/orig_0.8_rep.txt")
        ans = document.caculate_similarity(doc_1, doc_2)
        print(ans)
        self.assertGreaterEqual(ans,0)
        self.assertLessEqual(ans,1)

    def test_err_1(self):
        print('test err_1')
        self.assertRaises(FileNotFoundError, document.Document, '123')

    def test_err_2(self):
        print('test err_2')
        doc_1 = document.Document(r"1.txt")
        doc_2 = document.Document(r"2.txt")
        self.assertRaises(RuntimeError, document.caculate_similarity, doc_1, doc_2)


if __name__ == '__main__':
    unittest.main()

計算模塊異常處理

計算模塊只寫了一個文本為空的時候拋出異常，因為會導致計算余弦相似度的時候出現除以零的情況（想不到別的了）。

    keywords_1 = set(doc_1.get_words())
    keywords_2 = set(doc_2.get_words())
    if len(keywords_1) == 0 or len(keywords_2) == 0:
        print("文本無有效詞匯，請輸入包含有效詞匯的文本路徑")
        raise

PSP 表格

PSP2.1	Personal Software Process Stages	預估耗時（分鍾）	實際耗時（分鍾）
Planning	計划	60	40
Estimate	估計這個任務需要多少時間	30	30
Development	開發	180	240
Analysis	需求分析 (包括學習新技術)	480	600
Design Spec	生成設計文檔	60	40
Design Review	設計復審	30	40
Coding Standard	代碼規范 (為目前的開發制定合適的規范)	30	15
Design	具體設計	120	120
Coding	具體編碼	180	240
Code Review	代碼復審	30	30
Test	測試（自我測試，修改代碼，提交修改）	120	240
Reporting	報告	20	20
Test Report	測試報告	30	40
Size Measurement	計算工作量	20	20
Postmortem & Process Improvement Plan	事后總結, 並提出過程改進計划	30	40
	合計	1420	2025

總結

講實話學算法和寫代碼的時間好像花的也並不是很多，到是折騰 git、單元測試、性能分析啥的折騰了挺久（都是第一次用，研究了好長時間）。為了寫這次作業，瘋狂的使用搜索引擎，各種看博客，才慢慢磨出了這結果，~~老折磨了~~。u1s1還是自己學得太少，才寫一次作業就得各種補漏（要是以前沒事的時候多學點東西多好）。雖然累，但是靠着百度的力量，不僅學到了很多從來沒接觸過的知識，很大程度上也提高自己解決問題的能力，也算是一次很不錯的體驗。還是希望以后別當懶狗了，積極一點，多學點知識，以后總用得上的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次編程作業