第一次個人編程作業

本文轉載自查看原文 2021-09-06 14:51 370

https://github.com/Jimase/Software_Engineering/tree/main/031902515

一、PSP表格

PSP2.1	Personal Software Process Stages	預估耗時（分鍾）	實際耗時（分鍾）
Planning	計划	30	72
· Estimate	· 估計這個任務需要多少時間	1200	1600
Development	開發	700	700
· Analysis	· 需求分析 (包括學習新技術)	5	55
· Design Spec	· 生成設計文檔	5	55
· Design Review	· 設計復審	5	55
· Coding Standard	· 代碼規范 (為目前的開發制定合適的規范)	30	55
· Design	· 具體設計	120	120
· Coding	· 具體編碼	360	720
· Code Review	· 代碼復審	50	50
· Test	· 測試（自我測試，修改代碼，提交修改）	30	50
Reporting	報告	90	180
· Test Repor	· 測試報告	30	60
· Size Measurement	· 計算工作量	10	20
· Postmortem & Process Improvement Plan	· 事后總結, 並提出過程改進計划	10	100
	· 合計	1200	1800

二、計算模塊接口

1.計算模塊接口的設計與實現過程

1.命令行參數實現

if __name__ == '__main__':
#print(sys.argv)
#print(len(sys.argv))
if len(sys.argv) != 4:
    print("參數錯誤，請以此給出敏感詞文件，待檢測文件和結果文件")
    exit(-1)
args = sys.argv
main(args[1], args[2], args[3])
綜合考慮以后決定選擇python完成作業，考慮到作業要求，先把命令行實現了。

2.額外庫調用

import sys
from pypinyin import pinyin, Style
from hanzi_chaizi.hanzi_chaizi import HanziChaizi
from langconv import Converter
from Pinyin2Hanzi.Pinyin2Hanzi import DefaultHmmParams
from Pinyin2Hanzi.Pinyin2Hanzi import viterbi

2021.10.8回來展望，這個地方犯了非常嚴肅的錯誤，不應該把別人的組件直接下載到自己的項目中使用，沒有充分考慮測試需求，本地能夠全測不代表測試組的同學們可以，給測試組的同學們帶來了極大的不便和麻煩依然未能解決，如果是在科研項目中代碼不可復現，我的做法可能構成學術不端，此處做個紀念，警醒自我。

3.部分核心接口(實現代碼略）

view code

#檢驗是否全是中文字符
def is_all_chinese(strs):
    for _char in strs:
        if not '\u4e00' <= _char <= '\u9fa5':
            return False
    return True

def chs_to_cht(sentence):  # 傳入參數為列表
    """
    將簡體轉換成繁體
    :param sentence:
    :return:
    """
    sentence = ",".join(sentence)
    sentence = Converter('zh-hant').convert(sentence)
    sentence.encode('utf-8')
    return sentence.split(",")

def get_permutation(cstr, pstr, deepth, ans, nowlist):
    if deepth == len(cstr):
        # print(nowlist)
        ans.append(nowlist)
        # print("ans: ", ans)
        return
    for i in range(2):
        if i == 0:
            # nowlist.append(cstr[deepth])
            get_permutation(cstr, pstr, deepth + 1, ans, nowlist + [cstr[deepth]])
            # del nowlist[len(nowlist) - 1]
        elif i == 1:
            # nowlist.append(pstr[deepth])
            get_permutation(cstr, pstr, deepth + 1, ans, nowlist + [pstr[deepth]])
            # del nowlist[len(nowlist) - 1]
    return

def get_bushouword(bushoulist, deepth, ans, noword):
    if deepth == len(bushoulist):
        ans.append(noword)
        return
    for bushou in bushoulist[deepth]:
        sigleword = ""
        for item in bushou:
            sigleword += item
        get_bushouword(bushoulist, deepth + 1, ans, noword + sigleword)

def get_fantizuhe(array1, array2, deepth, ans, noword):
    if deepth == len(array1):
        ans.append(noword)
        return
    for i in range(2):
        if i == 0:
            get_fantizuhe(array1, array2, deepth + 1, ans, noword + array1[deepth])
        elif i == 1:
            get_fantizuhe(array1, array2, deepth + 1, ans, noword + array2[deepth])

def get_fpyzu(array1, array2, deepth, ans, noword):
    if deepth == len(array1):
        ans.append(noword)
        return
    for i in range(2):
        if i == 0:
            get_fpyzu(array1, array2, deepth + 1, ans, noword + array1[deepth])
        elif i == 1:
            get_fpyzu(array1, array2, deepth + 1, ans, noword + array2[deepth])

def get_xieyinzuhe(array, deepth, ans, noword):
    if deepth == len(array):
        ans.append(noword)
        return
    for i in range(len(array[deepth])):
        get_xieyinzuhe(array, deepth + 1, ans, noword + array[deepth][i])

4.DFA算法部分

view code

# DFA算法
class DFAFilter(object):
    # 構造函數的參數為關鍵詞文件路徑
    def __init__(self, senstive_path, result_path):
        # 關鍵詞字典
        self.keyword_chains = {}
        # 限定讀
        self.delimit = '\x00'
        self.parse(senstive_path)
        self.result_path = result_path
        self.total = 0
        self.rp = open(self.result_path, "w")

    # 向關鍵詞字典中插入關鍵字
    def add(self, keyword, rawkeyword):
        # 關鍵詞英文變為小寫
        chars = keyword.lower()
        if not chars: return
        level = self.keyword_chains
        # 遍歷關鍵字的每個字
        for i in range(len(chars)):
            # 如果這個字已經存在字符鏈的key中就進入其子字典
            if chars[i] in level:
                level = level[chars[i]]
            else:
                if not isinstance(level, dict):
                    break
                for j in range(i, len(chars)):
                    level[chars[j]] = {}
                    last_level, last_char = level, chars[j]
                    level = level[chars[j]]
                last_level[last_char] = {self.delimit: rawkeyword}
                break

    # 構建關鍵詞字典
    def parse(self, path):
        with open(path, encoding='utf-8') as f:
            for keyword in f.readlines():
                ckeyword = keyword.strip()
                if is_all_chinese(ckeyword):
                    # 構建拼音敏感字
                    x = pinyin(ckeyword, style=Style.NORMAL)
                    pkeyword = [item[0] for item in x]
                    ans = []
                    get_permutation(ckeyword, pkeyword, 0, ans, [])
                    for ansitem in ans:
                        tkeyword = ""
                        for item in ansitem:
                            tkeyword += item
                        self.add(tkeyword, ckeyword)

                    # 得到首字母類型的拼音
                    py = pinyin(ckeyword, style=Style.FIRST_LETTER)
                    fpy = ""
                    for item in py:
                        fpy += item[0]
                    fpzu = []
                    get_fpyzu(ckeyword, fpy, 0, fpzu, "")
                    for item in fpzu:
                        # print(item)
                        self.add(item, ckeyword)

                    # 構建諧音敏感字
                    # 首先得到文字的拼音
                    py =  pinyin(ckeyword, style=Style.NORMAL)
                    hmmparams = DefaultHmmParams()
                    xieyin = []
                    for item in py:
                        result = viterbi(hmm_params=hmmparams, observations=(item))
                        xieyin.append([item2.path[0] for item2 in result])
                    xieyinzuhe = []
                    get_xieyinzuhe(xieyin, 0, xieyinzuhe, "")
                    for item in xieyinzuhe:
                        print(item)
                        self.add(item, ckeyword)

                    # 構建部首敏感字
                    hc = HanziChaizi()
                    bushou = []
                    for item in ckeyword:
                        ans = hc.query(item)
                        bushou.append(ans)
                    ans = []
                    get_bushouword(bushou, 0, ans, "")
                    for bushouword in ans:
                        self.add(bushouword, ckeyword)

                    # 構建繁體字字典
                    fanti = chs_to_cht(ckeyword)
                    fjti = []
                    get_fantizuhe(fanti, ckeyword, 0, fjti, "")
                    for item in fjti:
                        self.add(item, ckeyword)
                else:
                    self.add(ckeyword, ckeyword)
            print(self.keyword_chains)

    # 根據關鍵字字典過濾出輸入字符串message中的敏感詞
    def filter(self, message, linenumber):
        rawmessage = message
        message = message.lower()
        start = 0
        while start < len(message):
            level = self.keyword_chains
            # 當字符不在關鍵字字典時
            if message[start] not in level:
                start += 1
                continue
            if is_all_chinese(message[start]): mode = "c"
            else: mode = "e"
            step_ins = 0
            sensitive_word = ""
            left, right = start, 0
            ok = False
            for char in message[start:]:
                if char.isdigit():
                    step_ins += 1
                    continue
                if char not in level and mode == "c" and char.encode("utf-8").isalpha():
                    step_ins += 1
                    continue
                # 特殊字符判斷，當一個字符既不是中文又不是英文和數字時被認定為為特殊字符
                if not is_all_chinese(char) and not char.encode("utf-8").isalpha()\
                    and not char.isdigit():
                    step_ins += 1
                    continue
                # 新字在敏感詞字典鏈表中
                if char in level:
                    # sensitive_word += char
                    step_ins += 1
                    # 特定字符不在當前字的value值里，嵌套遍歷下一個
                    if self.delimit not in level[char]:
                        level = level[char]
                    else:
                        start += step_ins - 1
                        right = start
                        ok = True
                        sensitive_word = level[char][self.delimit]
                        break
                # 新字不在敏感詞字典鏈表中
                else: break
            if ok:
                anstr = "Line{}: <{}> {}\n".format(linenumber, sensitive_word, rawmessage[left: right + 1])
                print(anstr, end="")
                self.rp.write(anstr)
                self.total += 1
            start += 1

    def __del__(self):
        self.rp.write(str(self.total))
        self.rp.close()

2.計算模塊接口部分的性能改進

1.模塊改進與時間消耗主要如下：

單一庫改為字典樹實現(120min)

暴力對比改為略優化的DFA

性能分析圖

能力有限，查閱了相關的資料，主流的方法只有2.5種：

DFA:即Deterministic Finite Automaton，也就是確定有窮自動機。核心是建立了以敏感詞為基礎的許多敏感詞樹。考慮文本復雜性，最終采取了此法，將檢索難度轉移到了字典樹的建立

DFA:算法直觀上怎么做？

當我們開始考慮狀態的離開轉換時，將其設置為“已標記的”。DFA的開始狀態是firstpos(root)，root代表抽象語法樹的根結點，DFA的終結狀態是那些包含了#所在位置的狀態。下面我們給出構造算法的偽代碼描述

AC自動機：字典樹的基礎上：變形KMP算法+失配指針

自然語言處理：盲猜出題人的課題是這個，這也是比較合理的方法

3.計算模塊部分單元測試展示

1、白盒測試

由於測試組未明確指示后台15個測試點類型：分類如下,為避免屏蔽，用化名為例
- 純粹敏感詞類型：line 250：<你好> 你好（√）
- 純粹拼音類型：line 250：<你好> nihao（√）
- 諧音類型：line 250：<你好> 泥豪（√）
- 繁體類型：line 111：<蘇> 蘇（√）
- 插入英文字符類型：line 250：<你好> 你abc好（√）
- 插入無效字符：line 250：<你好> 你@@@###好（√）
- 以及嵌套類型："你abc_++++好"、"ni好"、"N好"、"你_郝"

view code

class functionTest(unittest.TestCase):
    def test_seperate(self):  #左右結構拆分
        word="你好"
        org="亻爾女字"
        self.text（）
    def test_pinyin(self):  #拼音大小寫、拼音首字母
        word="你好"
        org=["nihao"、"Nhao]
        self.text（）
    def test_insert(self):  #插入字母或特殊字符
        word="你好"
        org="你ABC好"
        self.text（）

    def test_homophones(self):  #插入字母或特殊字符
        word="你好"
        org="你！@#￥%……*（））————好"
        self.test（）

    def test_english(self):  #大小寫轉換以及插入特殊字符
        word="hello"
        org="Hel8!lo13"
        self.test(text,"Hel8!lo")

2、樣例測試（給予的樣例涵蓋了所有潛在敏感詞）：

ans.txt文檔共計504行（測試組一共給了三個版本，明確指出其中仍有問題，但為了考驗大家手動數據處理能力，后續不以更新）我們假設ans.txt一共504行

sensitive_dec0版本 399行

sensitive_dec2版本 470行欠妥輸出

sensitive_dec3版本 475行仍未能達到ans要求。當然，我也不知道ans.txt長啥樣，只知道約有500行

sensitive_dec4.7版本（2021.9.14），使用cProfile做基本測試，結果有點詭異：

1、為啥有516行結果呢？標准答案ans.txt才504啊
2、為啥只需要1.462秒呢？（不應該這么快的，非自謙，是真的不應該）
3、為啥有437103個函數被調用呢，我哪寫了這么多啊？

解答上述問題：
1、我的未必對，但明顯測試組的給的ans.txt有問題，所謂~~白盒測試~~盲盒測試（確信）
2、不知道，見鬼了吧。
3、cprofile按時間排序看下調用前幾的函數吧，里面大部分是庫的初始化函數。當然還有很大一部分是我自己的檢測部分。只展示時間排序靠前的（為了測試額外加了三個，所以是519）

程序中消耗最大的函數（耗時與空間內存）無疑是DFA實現部分：

在拼音和部首的初始化上面廢了很多時間，而且在我的能力范圍內無能為力。。。。

最后一天改完累死了。sensitive_dec7.0 516應該沒錯了。

4.計算模塊部分異常處理說明

1.命令行異常處理：

if len(sys.argv) != 4:
    print("參數錯誤，請以此給出敏感詞文件，待檢測文件和結果文件")
    exit(-1)
args = sys.argv
main(args[1], args[2], args[3])

2.字典序更新過程之中，額外字典問考慮輸出形式要求

   # 根據關鍵字字典過濾出輸入字符串message中的敏感詞
def filter(self, message, linenumber):
    rawmessage = message
    message = message.lower()
    start = 0
    while start < len(message):
        level = self.keyword_chains
        # 當字符不在關鍵字字典時
        if message[start] not in level:
            start += 1
            continue
        if is_all_chinese(message[start]): mode = "c"
        else: mode = "e"
        step_ins = 0
        sensitive_word = ""
        left, right = start, 0
        ok = False
        for char in message[start:]:
            if char.isdigit():
                step_ins += 1
                continue
            if char not in level and mode == "c" and char.encode("utf-8").isalpha():
                step_ins += 1
                continue
            # 特殊字符判斷，當一個字符既不是中文又不是英文和數字時被認定為為特殊字符
            if not is_all_chinese(char) and not char.encode("utf-8").isalpha()\
                and not char.isdigit():
                step_ins += 1
                continue
            # 新字在敏感詞字典鏈表中
            if char in level:
                # sensitive_word += char
                step_ins += 1
                # 特定字符不在當前字的value值里，嵌套遍歷下一個
                if self.delimit not in level[char]:
                    level = level[char]
                else:
                    start += step_ins - 1
                    right = start
                    ok = True
                    sensitive_word = level[char][self.delimit]
                    break
            # 新字不在敏感詞字典鏈表中
            else: break
        if ok:
            anstr = "Line{}: <{}> {}\n".format(linenumber, sensitive_word, rawmessage[left: right + 1])
            print(anstr, end="")
            self.rp.write(anstr)
            self.total += 1
        start += 1

部分字典樹展示

三、心得

1.一些小記錄：

刪掉輸入文本的無關字符包括標點、空格、各種奇怪字符
把輸入文本轉化為拼音?用AC自動機的算法？（暫時不完成部首的解決問題）
部首和拼音都有現成的庫，不過處理方式不同。
核心的難點在於復現，這一點真的很難！！！考慮到時間要求，必須動態處理
DFA?
更新words.txt和ans.txt給我的體驗極差，改來改去還不是一回事，為啥一開始不能弄好。

2.心得總結（平復一下怨氣寫了下）

突擊學了python，總不能C++寫吧！，沒有很系統化，但能看出來，python在很多方面確實比C++方便得多。
git指令學了好久，才傳到GitHub上。
cprofile生成性能圖還要用可視化工具，弄了好久。應該多做好區塊化，要做單元測試應該從一開始就設計好。
算法真的太難理解，費了好大的功夫，平常應該多花些時間在有意義的學習中，目前主流的人工智能機器學習研究生項目，對算法要求很高。
我心里只有感恩。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次個人編程作業第一次編程作業

第一次個人編程作業

一、PSP表格

二、計算模塊接口

1.計算模塊接口的設計與實現過程

1.命令行參數實現

2.額外庫調用

3.部分核心接口(實現代碼略）

4.DFA算法部分

2.計算模塊接口部分的性能改進

1.模塊改進與時間消耗主要如下：

3.計算模塊部分單元測試展示

1、白盒測試

2、 樣例測試（給予的樣例涵蓋了所有潛在敏感詞）：

4.計算模塊部分異常處理說明

1.命令行異常處理：

2.字典序更新過程之中，額外字典問考慮輸出形式要求

三、心得

1.一些小記錄：

2.心得總結（平復一下怨氣寫了下）

免責聲明！

2、樣例測試（給予的樣例涵蓋了所有潛在敏感詞）：