現代軟件工程個人作業——詞頻統計（字符數、行數、單詞數、高頻單詞和詞組）

本文轉載自查看原文 2018-03-29 15:50 995 C++實現/ vector/ map/ 詞頻統計

現代軟件工程課的第一次個人作業博主做的相當差勁，讓我清楚地意識到自己與他人的差距。

通過這篇博客博主將展示自己是如何走上事倍功半的歧路，認真分析錯誤原因，希望大家不要重蹈我的覆轍。

首先讓我們來看一下作業要求：詳細要求在鄧宏平老師的博客：第一次個人作業——詞頻統計

這次詞頻統計的主要功能有：

1. 統計文件的字符數（只需要統計Ascii碼，漢字不用考慮，換行符不用考慮,'\0'不用考慮）（ascii碼大小在[32,126]之間）

2. 統計文件的單詞總數

3. 統計文件的總行數（任何字符構成的行，都需要統計）（不要只看換行符的數量，要小心最后一行沒有換行符的情形）（空行算一行）

4. 統計文件中各單詞的出現次數，對給定文件夾及其遞歸子文件夾下的所有文件進行統計

6. 統計兩個單詞（詞組）在一起的頻率，輸出頻率最高的前10個。

注意：

a) 空格，水平制表符，換行符，均算字符

b) 單詞的定義：至少以4個英文字母開頭，跟上字母數字符號，單詞以分隔符分割，不區分大小寫。

英文字母：A-Z，a-z

字母數字符號：A-Z，a-z，0-9

分割符：空格，非字母數字符號

例如：”file123”是一個單詞，”123file”不是一個單詞。file，File和FILE是同一個單詞。

如果兩個單詞只有最后的數字結尾不同，則認為是同一個單詞，例如，windows，windows95和windows7是同一個單詞，iPhone4和IPhone5是同一個單詞，但是，windows和windows32a是不同的單詞，因為他們不是僅有數字結尾不同。輸出按字典順序，例如，windows95，windows98和windows2000同時出現時，輸出windows2000。單詞長度只需要考慮[4, 1024]，超出此范圍的不用統計。

c)詞組的定義：windows95 good， windows2000 good123，可以算是同一種詞組。按照詞典順序輸出。三詞相同的情形，比如good123 good456 good789，根據定義，則是 good123 good123 這個詞組出現了兩次。

good123 good456 good789這種情況，由於這三個單詞與good123都是同一個詞，最終統計結果是good123 good123這個詞組出現了2次。

兩個單詞分屬兩行，也可以直接組成一個詞組。統計詞組，只看順序上，是否相鄰。

d) 輸入文件名以命令行參數傳入。需要遍歷整個文件夾時，則要輸入文件夾的路徑。

e) 輸出文件result.txt

characters: number

words: number

lines: number

<word>: number

<word>為文件中真實出現的單詞大小寫格式，例如，如果文件中只出現了File和file，程序不應當輸出FILE，且<word>按字典順序（基於ASCII）排列，上例中程序應該輸出File: 2

f) 根據命令行參數判斷是否為目錄

g) 將所有文件中的詞匯，進行統計，最終只輸出一個整體的詞頻統計結果。

評分標准

1. 統計文件的字符數(1分)

2. 統計文件的單詞總數(1分)

3. 統計文件的總行數(1分)

4. 統計文件中各單詞的出現次數(1分)

5. 對給定文件夾及其遞歸子文件夾下的所有文件進行統計(2分)

6. 統計兩個單詞（詞組）在一起的頻率，輸出頻率最高的前10個(2分)

以上六個結果輸出錯誤則對應子任務得-1分，全部輸出正確則按運行時間確定排名(用時按升序前30%得滿分8分，30%-70%得7.5分，后30%得7分)。

7. 博客撰寫(代碼實現過程，性能分析、優化報告等)(2分)

8. 在Linux系統下，進行性能分析，過程寫到blog中（附加題，2分）

完成時間：一周之內

需求分析：

1.統計字符數和行數容易實現。

2.統計單詞總數：作業要求中對單詞的定義，是4個英文字母開頭，后跟零個或多個英文字母或數字，單詞長度在[4,1024]之間。一般來說匹配一定格式的字符串都用正則表達式和迭代器來實現。

3.統計統計文件中各單詞的出現次數，並輸出出現頻率最高的10個：單詞存放於容器中，沒出現一個新單詞需要查找它是不是已經存在了，如果存在的話單詞頻率加一，否則將單詞加入容器。如何實現判斷單詞相等和不等是重要的一點。將所有單詞收集到容器后需要根據出現頻率對單詞進行排序，並輸出頻率最高的10個。

4.輸出出現頻率最高的10個詞組：相鄰的兩個單詞組成一個詞組，也需要查重和依頻率排序。

5.對給定文件夾及其遞歸子文件夾下的所有文件進行統計：判斷是目錄還是文件，如果是目錄，需要獲取目錄下文件的名字再對文件進行處理。

接下來博主就走上了錯誤的第一步：選擇了不熟悉的編程語言

一般來說，要在短時間內完成復雜和難度較高的工程，應該選擇你熟悉的編程語言。

博主會對C比較熟悉但是沒有系統學過C++，希望通過這門課的實踐可以把C++學起來。結果博主天真地看了一整天的Primer C++(╯°Д°）╯，第二天就急躁躁地開始編程了。相信很多人都知道這本經典有多么晦澀難懂，可想而知一天下來我能吸收多少知識。事實上，更多的時間應該用來進一步分析需求，有針對性地查找解決方法，綜合功能和性能的考慮設計多個方案，在比較、測試過后篩選出合理的方案。對於這個作業我還是推薦用C++寫的，只是日程這么緊張的情況下，時間應該多分配給需求分析和前期構建的工作，C++上面的新知識可以現查現學。

結果，博主又走錯了路：選擇了不合理的數據結構

數據結構是重中之重，要慎之又慎

博主在沒有事先了解數據量的情況下選擇了vector作為容器，因為它有find函數和sort函數。這種偷懶的行為是極其不可取的。

構建工作應該優先考慮需求，而不是你目前的編程水平或者工作量。

事實上博主考慮過map，但是為什么沒有選擇它呢，原因主要是我一開始的思路是為單詞和詞組各定義一個類，在類中存放單詞和頻率，重構==運算符以便判斷是否相等，另外如果相等的單詞字典序先於目前這個單詞，就修改目前這個單詞，如果用map，單詞做key，頻率作value，可是map不支持修改key值，因為map會自動根據key排序，key是它排序的基礎。然后博主就這么把它拋棄了。

其實，可以建立兩個map，以單詞的簡寫作為共同的key，一個map的value是單詞的完整形式，另一個map的value是單詞的頻率。（助教的思路）

另外，C++11還支持unordered_map，它以哈希表為基礎，查找時間復雜度只有O(1)，而且不會自動根據Key值進行排序，但是占的空間相對較大。不過，對map類進行按值排序，一般需要將map中的數據以pair的形式傳遞給順序容器（如我選擇的vector）再用sort進行排序。順序容器可用的sort排序效率都非常高， vector使用的是快速排序。

如果讓博主再有一次機會，博主會選什么呢？答案是map。雖然將map中的數據轉移到vector中也需要耗費較多地時間，但只需要操作一次，也就是O(n)的復雜度但是vector中的find用的是線性探查，每得到一個新單詞都得查重，時間復雜度已經是O(n*n)了。后面博主自己寫了hash查找函數，和實時申請內存結合起來就會非常復雜（時間限制博主沒有實現(ಥ_ಥ)）。綜合來看還是用map或unordered map比較合理。

接下來分享代碼。

先上第一版，使用了vector可用的stl函數find和sort.

說明：1.這一版程序里面可能還有一些小bug和比較寫得比較生涉的地方，歡迎大家指出。05.txt是一份用於測試的文本文件。

2. 將文件里的所有內容讀作一個字符串，用來統計字符總數和收集新單詞、新詞組。

3.從str里面獲取新單詞是用正則表達式匹配的，具體在getNewExpr函數里。

4.使用sort之前需要定義compare方式或者重構>、<運算符，博主采用前者。

5.getAllFiles函數用於判斷路徑是目錄還是文件，如果是目錄的話獲取所有文件名並放入一個string類的vector中。

6.閱讀源碼建議先讀類定義然后從main函數開始依照線程閱讀。

#include<iostream>
#include<fstream>
#include<string>
#include<sstream>
#include<vector>
#include<algorithm>
#include<cctype>
#include<regex>
#include<io.h>

using namespace std;
int begFlag = 1;

typedef struct {
    unsigned int charNum;
    unsigned int lineNum;
    unsigned int wordNum;
}amount;

class word {

private:
    string wordStr;
    unsigned int freq;
public:
    word() = default;
    word(string str) {
        wordStr = str;
        freq = 1;
    }

    string getWordStr() {
        return wordStr;
    }

    unsigned int getFreq() {
        return freq;
    }

    void addFreq() {
        freq++;
    }

    void resetWordStr(string str) {
        if (str < wordStr) {
            wordStr = str;
        }
    }

    bool operator == (const word &obj) const {
        string word1 = this->wordStr, word2 = obj.wordStr;
        int i = word1.length() - 1;
        int j = word2.length() - 1;
        while (i >= 0)
        {
            if (word1[i] >= '0'&&word1[i] <= '9')
                word1[i] = '\0';
            else break;
            i--;
        }
        while (j >= 0)
        {
            if (word2[j] >= '0'&&word2[j] <= '9')
                word2[j] = '\0';
            else break;
            j--;
        }
        if (i == j) {

            for (int t = 0; t <= i; t++) {
                if (word1[t] != word2[t] && abs(word1[t] - word2[t]) != 32)
                    return false;
            }


        }
        else return false;
        return true;
    }

    void printWord(ofstream &output) {
        output << wordStr << "\t" << freq << endl;
    }
};


class phrase {

private:
    //string phrStr;
    unsigned int freq;
    word part1, part2;

public:
    //lack a default constructor

    phrase(word part1, word part2) {
        this->part1 = part1;
        this->part2 = part2;
        //phrStr = str;
        freq = 1;
    }
    /*
    string getPhrStr() {
        return phrStr;
    }
    */
    word getPart1() {
        return part1;
    }

    word getPart2() {
        return part2;
    }

    unsigned int getFreq() {
        return freq;
    }

    void addFreq() {
        freq++;
    }

    void resetPhrase(phrase &obj) {

        //string objStr = obj.getPhrStr();

        word objPart1 = obj.getPart1();   
        word objPart2 = obj.getPart2();  
        // '||' is a short circuit operator
        if (objPart1.getWordStr() < this->part1.getWordStr() || objPart2.getWordStr() < this->part2.getWordStr()) {
            this->part1 = objPart1;
            this->part2 = objPart2;
        }
    }

    bool operator == (const phrase &obj) const {

        word objPart1 = obj.part1, objPart2 = obj.part2;
        return (part1 == objPart1 && part2 == objPart2);
    }

    void printPhrase(ofstream &output) {
        string word1 = part1.getWordStr(), word2 = part2.getWordStr();
        word1 < word2 ?
            output << word1 + " " + word2 << "\t" << freq << endl :
            output << word2 + " " + word1 << "\t" << freq << endl;
    }
};


bool wordCompare(word former, word latter) {
    return former.getFreq() > latter.getFreq();
}

bool phraseCompare(phrase former, phrase latter) {
    return former.getFreq() > latter.getFreq();
}



void examineNewWord(vector<word> &wvec, word &newWord) {

    vector<word>::iterator beg = wvec.begin(), end = wvec.end(), itr;
    itr = find(beg, end, newWord);    //is there any repition?

    if (itr != end) {                 // this word already exists in wvec
        itr->resetWordStr(newWord.getWordStr());
        itr->addFreq();
    }
    else {
        wvec.push_back(newWord);
    }
}

void examineNewPhr(vector<phrase> &pvec, phrase &newPhrase) {

    vector<phrase>::iterator beg = pvec.begin(), end = pvec.end(), itr;
    itr = find(beg, end, newPhrase);   ////is there any repition?

    if (itr != end) {
        itr->resetPhrase(newPhrase);
        itr->addFreq();
    }
    else {
        pvec.push_back(newPhrase);
    }
}


/* collect all expressions that match the definition of word in the parameter string */
void getNewExpr(string &str, vector<word> &wvec, vector<phrase> &pvec, amount &result) {

    word newWord;
    string wordPattern("[[:alpha:]]{4}[[:alnum:]]{0,1020}");
    regex reg(wordPattern);

    //intermediate variables in generating a new phrase
    
    //string::size_type pos1, pos2;
    string newPhrStr = "\0";
    word part1("\0"), part2("\0");
    phrase newPhrase( part1, part2);
    /*
      collect a word in advance, then combine two words and the substring 
      between them into  a phrase
    */
    for (sregex_iterator it(str.begin(), str.end(), reg), end_it;
        it != end_it; it++) {
        
        result.wordNum++;
        newWord = word(it->str());
        examineNewWord(wvec, newWord);

        if (begFlag) {
            
            begFlag = 0;
            part1 = newWord;
        }
        else {
            
            part2 = newWord;
            newPhrase = phrase(part1, part2);
            examineNewPhr(pvec, newPhrase);
                              //pos1 = pos2;
            part1 = part2;
        }
        
    }
}

/* calculate the amount of characters with ASCII code within [32,126]*/
unsigned long getCharNum(string &str) {

    unsigned long charNum = 0;
    string::iterator end = str.end(), citr;
    for (citr = str.begin(); citr != end; citr++) {
        if (*citr >= 32 && *citr <= 126)
            charNum++;
    }
    return charNum;
}


/* calculate the number of lines in one file */
unsigned long getLineNum(string filename) {

    ifstream input(filename);
    unsigned long lines = 0;
    string str;
    while (!input.eof()) {

        getline(input, str);
        lines++;
    }

    return lines;
}



/*
  process one file, update the amount of characters and the amount of lines, 
  collect all expressions that match the word definition into wvec.
*/
void fileProcess(string filename, amount &result, vector<word> &wvec, vector<phrase> &pvec) {

    ifstream input;
    stringstream buffer;
    string srcStr;

    try {
        input.open(filename);
        if (!input.is_open()) {
            throw runtime_error("cannot open the file");
        }
    }
    catch (runtime_error err) {
        cout << err.what();
        return ;
    }
    

    if (input.eof())
        return;

        buffer << input.rdbuf();
        srcStr = buffer.str();

        // update the amount of characters
        result.charNum += getCharNum(srcStr);

        //update the amount of lines
        result.lineNum += getLineNum(filename);

        //update the wvec
        getNewExpr(srcStr, wvec, pvec,result);
    

    input.close();
}


/* print the results in the required format*/
void getResult(const char* resfile, amount &result, vector<word> &wvec, vector<phrase> &pvec) {

    auto wvecSize = wvec.size();
    auto pvecSize = pvec.size();
    ofstream output(resfile);

    output << "char_number :" << result.charNum << endl;
    output << "line_number :" << result.lineNum << endl;
    output << "word_number :" << result.wordNum << endl;

    //sort wvec in descending frequency order
    vector<word>::iterator wbeg = wvec.begin(), wend = wvec.end(), witr;
    sort(wbeg, wend, wordCompare);

    output << " " << endl;
    output << "the top ten frequency of words" << endl;
    if(wvecSize){

        if (wvecSize < 10) {
            for (witr = wbeg; witr != wend; witr++) {
                witr->printWord(output);
            }
        }
        else {
            vector<word>::iterator wlast = wbeg + 10;
            for (witr = wbeg; witr != wlast; witr++) {
                witr->printWord(output);
            }
        }
    }

    //sort pvec in descending frequency order
    vector<phrase>::iterator pbeg = pvec.begin(), pend = pvec.end(), pitr;
    sort(pbeg, pend, phraseCompare);

    output << " " << endl;
    output << "the top ten frequency of phrases" << endl;
    if (pvecSize) {

        if (pvecSize < 10) {
            for (pitr = pbeg; pitr != pend; pitr++) {
                pitr->printPhrase(output);
            }
        }
        else {
            vector<phrase>::iterator plast = pbeg + 10;
            for (pitr = pbeg; pitr != plast; pitr++) {
                pitr->printPhrase(output);
            }
        }
    }
    
}


/* determine whether the given path is a directory or a file, 
   if it is a directory, push names of all the files in the 
   directory into fvec*/
int getAllFiles(string path, vector<string> &files)
{ 
    long   hFile = 0;
    int flag = -1;
    
    struct _finddata_t fileinfo;  
    string p;  

    if ((hFile = _findfirst(p.assign(path).append("\\*").c_str(), &fileinfo)) != -1)
    {
        flag = 0;
        while (_findnext(hFile, &fileinfo) == 0)
        {
            if ((fileinfo.attrib &  _A_SUBDIR))  //if it is a folder
            {
                if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0)
                {
                    //files.push_back(p.assign(path).append("/").append(fileinfo.name));//save filename
                    getAllFiles(p.assign(path).append("/").append(fileinfo.name), files);
                }
            }
            else    //it is a file
            {
                files.push_back(p.assign(path).append("/").append(fileinfo.name));//文件名
            }
        }  
        _findclose(hFile);
    }

    return flag;
}


int main(int argc, char* argv[]) {

    amount  result;
    result.charNum = 0;
    result.lineNum = 0;
    result.wordNum = 0;
    vector<word> wvec;
    vector<phrase> pvec;
    int dirFlag;
    vector<string> fvec;

    string path = "05.txt";
    
    const char* resFile = "AllFiles.txt";

    dirFlag = getAllFiles(path, fvec);

    if (dirFlag == 0) {

        vector<string>::iterator end = fvec.end(), it;
        for (it = fvec.begin(); it != end; it++) {
            fileProcess(*it, result, wvec, pvec);
        }
    }
    else {
        fileProcess(path, result, wvec, pvec);
    }
    
    getResult(resFile, result, wvec, pvec);
    system("pause");
}

輸出結果：

VS性能向導給出的結果：

由此可以確定examineNewPhr里的查找過程find非常耗時間，因為find是根據這個重構的運算符來判斷是否相等的。

為此，博主決定在vector里面存放一個哈希表。

第二版代碼如下：

說明：這一版只實現了單詞統計沒有實現詞組統計。除了word類定義、examineNewWord函數被修改，增加了hash函數外，其他基本沒太大的變化。newsample是我們這次作業的測試集，數據量大概175M字節。用上面第一版根本跑不起來。

#include<iostream>
#include<fstream>
#include<string>
#include<sstream>
#include<vector>
#include<algorithm>
#include<cctype>
#include<regex>
#include<io.h>

using namespace std;

#define WORD_POOL_SIZE 18000000;
#define MAX_FIGURES 20;
int begFlag = 1;

typedef struct {
    unsigned int charNum;
    unsigned int lineNum;
    unsigned int wordNum;
}amount;

class word{

public:
    string wordStr;
    unsigned int freq;

    word(string str, unsigned int fre) {
        wordStr = str;
        freq = fre;
    }


    void resetWordStr(string str) {
        if (str < wordStr) {
            wordStr = str;
        }
    }

    bool operator == (const word &obj) const {
        string word1 = this->wordStr, word2 = obj.wordStr;
        int i = word1.length() - 1;
        int j = word2.length() - 1;
        while (i >= 0)
        {
            if (word1[i] >= '0'&&word1[i] <= '9')
                word1[i] = '\0';
            else break;
            i--;
        }
        while (j >= 0)
        {
            if (word2[j] >= '0'&&word2[j] <= '9')
                word2[j] = '\0';
            else break;
            j--;
        }
        if (i == j) {

            for (int t = 0; t <= i; t++) {
                if (word1[t] != word2[t] && abs(word1[t] - word2[t]) != 32)
                    return false;
            }

        }
        else return false;
        return true;
    }

    void printWord(ofstream &output) {
        output << wordStr << "\t" << freq << endl;
    }
    
};

bool wordCompare(word former, word latter) {
    return former.freq > latter.freq;
}

unsigned int Hash(string str) {
    
    const char *p = str.c_str();
    unsigned int seed = 7, key;
    unsigned long long hash = 0;
    int figures=0;
    while (*p!='\0'&& figures<= 20) {
        hash = hash*seed + (*p);
        p++;
        figures++;
    }
    key = hash%WORD_POOL_SIZE;
    return(key);
}

void examineNewWord(vector<word> &wvec, word &newWord) {

    string str = newWord.wordStr;
    int i = str.length() - 1;

    while (i >= 0) {
        if (str[i] >= '0'&&str[i] <= '9') {
            str[i] = '\0';
        }
        else if (str[i]>=97&&str[i]<=122) {
            str[i] = str[i] - 32;
        }
        i--;
    }

    unsigned int key = Hash(str);
    int outOfSlot = 1;
    int open = 0;
    vector<word>::iterator beg = wvec.begin();
    vector<word>::iterator itr = beg + key;
    
    while (outOfSlot) {
        itr = beg + (itr - beg + open * 13)%WORD_POOL_SIZE;
        if (itr->wordStr == "\0") {
            itr->wordStr = newWord.wordStr;
            itr->freq++;
            outOfSlot = 0;
        }
        else if (*itr == newWord) {
            itr->resetWordStr(newWord.wordStr);
            itr->freq++;
            outOfSlot = 0;
        }
        open++;
    }

}

void getNewExpr(string &str, vector<word> &wvec, unsigned int &wordNum) {

    word newWord("\0",1);
    string wordPattern("[[:alpha:]]{4}[[:alnum:]]{0,1020}");
    regex reg(wordPattern);

    for (sregex_iterator it(str.begin(), str.end(), reg), end_it;
        it != end_it; it++) {
        wordNum++;
        newWord.wordStr = it->str();
        examineNewWord(wvec, newWord);
    }
}

/* calculate the amount of characters with ASCII code within [32,126]*/
unsigned long getCharNum(string &str) {

    unsigned long charNum = 0;
    string::iterator end = str.end(), citr;
    for (citr = str.begin(); citr != end; citr++) {
        if (*citr >= 32 && *citr <= 126)
            charNum++;
    }
    return charNum;
}


/* calculate the number of lines in one file */
unsigned long getLineNum(string filename) {

    ifstream input(filename);
    unsigned long lines = 0;
    string str;
    while (!input.eof()) {
        /*
        if (getline(input, str)) {

            lines++;
        }*/
        getline(input, str);
        lines++;
    }

    return lines;
}

void fileProcess(const char* filename, amount &result, vector<word> &wvec) {

    ifstream input;
    stringstream buffer;
    string srcStr;

    try {
        input.open(filename);
        if (!input.is_open()) {
            throw runtime_error("cannot open the file");
        }
    }
    catch (runtime_error err) {
        cout << err.what();
        return;
    }


    if (input.eof())
        return;

    buffer << input.rdbuf();
    srcStr = buffer.str();

    // update the amount of characters
    result.charNum += getCharNum(srcStr);

    //update the amount of lines
    result.lineNum += getLineNum(filename);

    //update the wvec
    getNewExpr(srcStr, wvec, result.wordNum);


    input.close();
}

/* print the results in the required format*/
void getResult(const char* resfile, amount &result, vector<word> &wvec) {

    auto wvecSize = wvec.size();
    //auto pvecSize = pvec.size();
    ofstream output(resfile);

    output << "char_number :" << result.charNum << endl;
    output << "line_number :" << result.lineNum << endl;
    output << "word_number :" << result.wordNum << endl;

    //sort wvec in descending frequency order
    vector<word>::iterator wbeg = wvec.begin(), wend = wvec.end(), witr;
    sort(wbeg, wend, wordCompare);

    output << " " << endl;
    output << "the top ten frequency of words" << endl;
    if (wvecSize) {

        if (wvecSize < 10) {
            for (witr = wbeg; witr != wend; witr++) {
                witr->printWord(output);
            }
        }
        else {
            vector<word>::iterator wlast = wbeg + 10;
            for (witr = wbeg; witr != wlast; witr++) {
                witr->printWord(output);
            }
        }
    }
    
}

/* determine whether the given path is a directory or a file,
if it is a directory, push names of all the files in the
directory into fvec*/
int getAllFiles(string path, vector<string> &files)
{
    long   hFile = 0;
    int flag = -1;

    struct _finddata_t fileinfo;
    string p;

    if ((hFile = _findfirst(p.assign(path).append("\\*").c_str(), &fileinfo)) != -1)
    {
        flag = 0;
        while (_findnext(hFile, &fileinfo) == 0)
        {
            if ((fileinfo.attrib &  _A_SUBDIR))  //if it is a folder
            {
                if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0)
                {
                    //files.push_back(p.assign(path).append("/").append(fileinfo.name));//save filename
                    getAllFiles(p.assign(path).append("/").append(fileinfo.name), files);
                }
            }
            else    //it is a file
            {
                files.push_back(p.assign(path).append("/").append(fileinfo.name));//文件名
            }
        }
        _findclose(hFile);
    }

    return flag;
}

int main(int argc, char* argv[]) {

    amount  result;
    result.charNum = 0;
    result.lineNum = 0;
    result.wordNum = 0;
    vector<word> wvec(18000000,word("\0",0));
    int dirFlag;
    vector<string> fvec;

    const char* path = "D:/Visual Studio/newsample";

    const char* resFile = "AllFiles.txt";

    dirFlag = getAllFiles(path, fvec);

    if (dirFlag == 0) {

        vector<string>::iterator end = fvec.end(), it;
        for (it = fvec.begin(); it != end; it++) {
            fileProcess(it->c_str(), result, wvec);
        }
    }
    else {
        fileProcess(path, result, wvec);
    }

    getResult(resFile, result, wvec);
    //system("pause");
}

在來看看VS給出的性能向導報告：

顯然查找的效率變高了很多，新的冤大頭轉移到到了正則表達式匹配上面。

這個問題如何優化，博主暫時還沒有進行調查。另外，這一版程序有一個突出的問題就是空間的浪費，即在main函數中直接開size為18000000的vector，這種做法對棧的占用率非常高，由於詞組的數目至少是單詞的兩倍，就需要把vector的空間開到35000000(因為實際結果是33000000多)，會導致Stack Overflow問題。這說明我的哈希沖突解決策略不合理，應該選擇可以動態申請內存、對空間利用比較合理的沖突解決方法。

看到這里，你大概能理解博主后悔的心情。要說錯誤的起始點在哪里，那還是前期需求分析和構建工作做的太倉促了。應老師要求，我們使用Teambition制作PSP（Personal Software Process (PSP，個人開發流程，或稱個體軟件過程)。但是博主找不到導出的功能鍵在哪里，（ಠ_ಠ），所以先上截圖再臨時做個表格吧

任務

預計完成時間

實際用時

學習C++基礎知識

10h

為各項功能實現解決方案

設計程序總架構

30min

25min

實現字符總數、行數、單詞總數、單詞出現次數的統計

2.5h

實現統計詞組出現次數，輸出最高10個

實現對文件夾內所有文件進行統計

1.5h

和標准結果進行對比，優化代碼

12h

∞

博主該去補一補其他科的作業了......

之后有時間的話會更新班里面詞頻統計做的比較優秀的方案。Σ(・ω・ノ)ノ

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 WordCount 統計行數、字符數、單詞數軟件工程-構建之法 WordCount小程序統計文件中字符串個數，單詞個數，詞頻，行數 C語言統計文件字符數單詞數行數 C++實現文件內字符數、單詞數、行數的統計 (第二次作業)運用Java統計字符數、單詞數、行數文件內容統計：對任意給定的.txt文件進行內容的字符數、行數、單詞數進行統計現代軟件工程作業 – 計算最長英語單詞鏈 linux統計單詞數統計單詞數 linux wc 的用法-linux 下統計行數、單詞數、字符個數