詞頻統計代碼任務總結

本文轉載自查看原文 2018-03-31 11:28 684

第一次個人代碼任務總結

——論一條咸魚的自我修養

1· 項目要求

1.1 基本功能

對源文件（*.txt,*.cpp,*.h,*.cs,*.html,*.js,*.java,*.py,*.php 等，文件夾內的所有文件）統計字符數、單詞數、行數、詞頻，統計結果以指定格式輸出到默認文件中，以及其他擴展功能，並能夠快速地處理多個文件。
使用性能測試工具進行分析，找到性能的瓶頸並改進。
對代碼進行質量分析，消除所有警告。
設計 10 個測試樣例用於測試，確保程序正常運行（如：空文件，只含一個詞的文件，只有一行的文件，典型文件等）。
使用 Github 進行代碼管理。
Linux 下性能分析（附加任務）。
撰寫博客。

1.2 注意事項　

a) 需要統計的字符等價於 ASCII 碼值屬於 [32，126] 區間。

b) 單詞的定義：至少以 4 個英文字母開頭，跟上字母數字符號；單詞以分隔符（非字母數字符號）分割；不分大小寫；如果兩個單詞只有最后的數字結尾不同，則認為等價；等價單詞輸出按字典順序；單詞長度只需要考慮 [4, 1024]，超出此范圍的不用統計。

c) 詞組的定義：同一文件內，相鄰兩個合法的單詞組成詞組；若兩單詞分別均等價，則為同一詞組；按字典順序輸出。

d) 輸入文件名以命令行參數傳入。需要遍歷整個文件夾時，則要輸入文件夾的路徑。

e) 根據命令行參數判斷是否為目錄。

f) 將所有文件中的詞匯，進行統計，最終只輸出一個整體的詞頻統計結果輸出文件 result.txt。

2· 思路歷程

2.1 遍歷文件夾

起初，並不知道 C/C++ 程序下如何打開指定路徑的目錄，上網查詢后得知在 <io.h> 下有可以使用的庫函數。其中，結構體_finddata_t 用於存儲文件信息，還需要用到 _findfirst、_findnext、_findclose 這三個函數。仔細學習了結構體各分量意義、函數參數傳遞、返回值，然后掌握了打開文件的方法，再結合遞歸調用即可，詳見后面。

2.2 讀取單詞

由於還需要統計所有合法字符數，行數，所以一開始思路是用 fgetc()，在打開文件之后逐字符讀取，字符數通過 ASCII 碼值統計，而行數可以構造映射除第一行外每一個換行符開啟新的一行，所以每個文件行數即為換行符數加一。而對於單詞數，可以邊讀字符數的時候，便進行分析判斷合法的單詞並統計。

接下來的問題就是如何確定存儲數據結構。一開始，我查詢過字典樹的相關知識，其優點是便於排序和查找，但是修改起來較為繁瑣況且我們並不需要排序，故棄之。接着，考慮過哈希表，哈希對於解決我們的需求有很大優勢，但是又覺得寫起來麻煩，需要構造哈希函數和沖突解決方式，又棄之。然后受到單詞前四個字符必須是字母的啟發，想到了將前面四個字母與四位 26 進制數進行一一映射，構造結構體儲存單詞信息，開辟長度為 26*26*26*26 的結構體指針數組，對於每個單詞通過首四字母計算唯一的地址，對於首四字母相同的單詞在該地址下進行鏈表字典有序查找、插入和修改。詳情參見后。

2.3 讀取詞組

對於讀取詞組這種要求，一開始我的內心其實是拒絕的，所以先想着先把單詞解決。殊不知，解決完單詞之后，尷尬的境地才出現了。其中我歷經了兩個徹底不同的處理詞組代碼的實現。

原始版本下，由於單詞結構體已經開辟，不想再為詞組申請大量存儲空間，我在單詞結構體之下增加一個分量，每次讀取一個單詞，順便拉一條鏈指向該單詞上一位出現過的所有單詞，由此存儲詞組信息，然后處境異常尷尬，出現了情理之中的恰如其分的繁瑣，更尷尬的就是我靜下心來也完成了統計，詞組 top10 結果均正確，運行也是情理之中的恰如其分的慢。代碼整合后已經到了星期四中午，但我還是決定重新處理這部分。

優化版本下，我還是改用了哈希表的方式處理詞組，因為我查到了 C++ 的 STL 庫中有 unordered_map 的庫可以用，之前並沒有接觸過，甚至 C++ 都不熟，還是學習了一下這其中的類和對象，然后重載方法，定義詞組結構體，里面有指向兩單詞的指針和該詞組數，將該結構體自定義為 hash key，然后又自定義 hash 函數和 hash 比較函數，使用 unordered_map 處理詞組。

2.4 頻率統計

由於一開始對革命路線錯誤的估計和判斷，很單純地覺得龐大的數量用遍歷肯定不好，所以在一邊讀取的時候一邊更新 top10，后來才意識到，實時更新復雜度更高，因為讀完后很多單詞都重復出現，而在實時讀取更新的時候則浪費了比較時間。

3· 具體實現

3.1 數據結構

3.1.1 單詞 (原始版）

1 struct wordsdata          //存放單詞信息
2 {
3     char words[1024];           //單詞字符串 
4     int number;                 //單詞數量
5     wordsdata *next;
6     phrase *prehead;            //指向詞組
7 };

struct wordsdata

單詞還會拉出一條鏈，指向所有與之構成詞組的上一位單詞。

3.1.2 詞組（原始版）

1 struct phrase      //詞組信息
2 {
3     int number;
4     char *before;    //指向前一位單詞
5     phrase *next;
6 };

struct phrase

存放指向上一位單詞的字符串地址。

3.1.3 單詞（優化版）

1 struct wordsdata                //存放單詞信息
2 {
3     char words[1024];           //單詞字符串
4     int number;                 //出現次數
5     wordsdata *next;
6 };

struct wordsdata

只存放單詞基本信息。

3.1.4 詞組（優化版）

1 struct phrases
2 {
3     char *one;
4     char *tw;
5     int num;
6 };

struct phrases

分別指向兩單詞字符串地址，保存數量信息。

3.1.5 哈希（優化版）

 1 struct phrase_cmp
 2 {
 3     bool operator()(const phrases &p1, const phrases &p2) const
 4     {
 5         return ((wordcmp(p1.one, p2.one) < 2) && (wordcmp(p1.tw, p2.tw) < 2));
 6     }
 7 
 8 };
 9 struct phrase_hash
10 {
11     size_t operator()(const phrases &ph) const
12     {
13         unsigned long __h = 0;
14         int temp;
15         size_t i;
16         for (i = 0; ph.one[i]; i++)
17         {
18             temp = ph.one[i];
19             if (temp > 64)
20             {
21                 (temp > 96) ? (temp - 96) : (temp - 64);
22                 __h += (29 * __h + temp);
23                 __h %= 2147483647;
24             }
25 
26         }
27         for (i = 0; ph.tw[i]; i++)
28         {
29             temp = ph.tw[i];
30             if (temp > 64)
31             {
32                 (temp > 96) ? (temp - 96) : (temp - 64);
33                 __h += (29 * __h + temp);
34                 __h %= 2147483647;
35             }
36         }
37 
38         return size_t(__h);
39     }
40 
41 };
42 
43 typedef unordered_map<phrases, int, phrase_hash, phrase_cmp> Char_Phrase;
44 Char_Phrase phrasemap;

hash

使用 unordered_map，並自定義 key ，即 3.14 中的詞組；自定義 hash 函數以及 hash 比較函數。

3.2 主要函數

3.2.1 遍歷文件夾

 1 int getfiles(char *path, struct _finddata_t *fileinfo, long handle)
 2 {                                    
 3     handle = _findfirst(path, fileinfo);            //第一次打開父目錄
 4     if (handle == -1)
 5         return -1;
 6 
 7 
 8     do
 9     {
10         //printf("> %s\n", path);           //顯示目錄名
11 
12         if (fileinfo->attrib & _A_SUBDIR)           //如果讀取到子目錄
13         {
14             if (strcmp(fileinfo->name, ".") != 0 && strcmp(fileinfo->name, "..") != 0)
15             {
16                 char temppath[1024] = "";              //記錄子目錄路徑
17                 long temphandle = 0;
18                 struct _finddata_t tempfileinfo;
19                 strcpy(temppath, path);
20                 strcat(temppath, "/*");
21 
22                 temphandle = _findfirst(temppath, &tempfileinfo);  //第一次打開子目錄
23                 if (temphandle == -1)
24                     return -1;
25 
26                 do                              //對子目錄所有文件遞歸
27                 {
28                     if (strcmp(tempfileinfo.name, ".") != 0 && strcmp(tempfileinfo.name, "..") != 0)
29                     {
30                         strcpy(temppath, path);
31                         strcat(temppath, "/");
32                         strcat(temppath, tempfileinfo.name);
33                         getfiles(temppath, &tempfileinfo, temphandle);
34                     }
35                 } while (_findnext(temphandle, &tempfileinfo) != -1);
36 
37                 _findclose(temphandle);
38             }//遞歸完畢
39 
40         } //子目錄讀取完畢
41         else
42             getwords(path, fourletter);
43 
44 
45     } while (_findnext(handle, fileinfo) != -1);
46 
47     _findclose(handle);       //關閉句柄
48 
49     return 1;
50 
51 }

getfiles

每次通過 char *path 來保存路徑，遞歸遍歷文件夾。

3.2.2 讀取單詞和詞組（原始版）

  1 int getwords(char *path, struct wordsdata **word)
  2 {
  3     FILE *fp;
  4     int cmp = 0;
  5     int num = 0;               //計算首四位地址
  6     char temp = 0;              //讀取一個字符 ACSII 碼值
  7     int length = 0;
  8     
  9     char present[1024] = "";  //存儲當前單詞
 10     char address[4] = "";
 11 
 12     struct wordsdata *pre = NULL; 
 13     struct wordsdata *q = NULL;
 14     struct wordsdata *newword = NULL;
 15     struct wordsdata *previous = NULL;
 16     struct phrase *newphrase = NULL;
 17     struct phrase *preword = NULL;
 18     struct phrase *p = NULL;
 19 
 20     if ((fp = fopen(path, "r")) == NULL)
 21     {
 22         printf("打開文件 %s 錯誤!!! \n", path);
 23         return 0;
 24     }
 25     linenum++;
 26     while (temp != -1)
 27     {
 28         //讀取字符串
 29         temp = fgetc(fp);
 30         if (temp > 31 && temp < 127)
 31             charnum++;
 32         if (temp == '\n' || temp == '\r')
 33             linenum++;
 34 
 35         while ((temp >= '0' && temp <= '9') || (temp >= 'a' && temp <= 'z') || (temp >= 'A' && temp <= 'Z'))
 36         {
 37             if (length != -1 && length < 4)
 38             {
 39                 if (temp >= 'A')  //是字母
 40                 {
 41                     present[length] = temp;
 42                     address[length] = (temp >= 'a' ? (temp - 'a') : (temp - 'A'));
 43                     length++;
 44                 }
 45                 else            //不是字母
 46                     length = -1;
 47             }
 48             else if (length >= 4)
 49             {
 50                 present[length] = temp;
 51                 length++;
 52             }
 53             temp = fgetc(fp);
 54             if (temp > 31 && temp < 127)
 55                 charnum++;
 56             if (temp == '\n' || temp == '\r')
 57                 linenum++;
 58         } // end while
 59 
 60         //判斷是否為單詞
 61         if (length >= 4)
 62         {
 63             wordnum++;
 64 
 65             //計算首四位代表地址
 66             num = address[0] * 17576 + address[1] * 676 + address[2] * 26 + address[3];
 67                     
 68             //插入當前單詞
 69             if (word[num] == NULL)
 70             {
 71                 word[num] = new wordsdata;
 72                 newword = new wordsdata;
 73                 newword->number = 1;
 74                 newword->next = NULL;
 75                 strcpy(newword->words, present);
 76                 word[num]->next = newword;        
 77 
 78                 if (previous != NULL)
 79                 {
 80                     newword->prehead = new phrase;
 81                     newword->prehead->before = previous->words;
 82                     newword->prehead->next = NULL;
 83                     newword->prehead->number = 1;
 84                 }
 85                 previous = newword;                
 86             }
 87             else
 88             {    
 89                 pre = word[num];
 90                 q = pre->next;
 91                 cmp = wordcmp(q->words, present);
 92 
 93                 while (cmp == small)
 94                 {
 95                     pre = q;
 96                     q = q->next;
 97                     if (q != NULL)
 98                         cmp = wordcmp(q->words, present);
 99                     else
100                         break;
101                 }
102                 if (q != NULL && cmp <= 1)
103                 {
104                     q->number++;
105                     if (cmp == 1)
106                         strcpy(q->words, present);
107 
108                     if (previous != NULL)
109                     {
110                         if (q->prehead == NULL)
111                         {
112                             q->prehead = new phrase;
113                             q->prehead->before = previous->words;
114                             q->prehead->next = NULL;
115                             q->prehead->number = 1;
116                         }
117                         else
118                         {
119                             p = q->prehead;
120                             while (p != NULL && p->before != previous->words)
121                             {
122                                 p = p->next;
123                             }
124                             if (p != NULL)
125                                 p->number++;
126                             else
127                             {
128                                 preword = new phrase;
129                                 preword->before = previous->words;
130                                 preword->number = 1;
131                                 preword->next = q->prehead;
132                                 q->prehead = preword;
133 
134                             }
135                         }
136                     }
137                     previous = q;
138                 }
139                             
140                 else 
141                 {
142                     newword = new wordsdata;
143                     newword->number = 1;
144                     strcpy(newword->words, present);
145                     pre->next = newword;
146                     newword->next = q;
147 
148                     if (previous != NULL)
149                     {
150                         newword->prehead = new phrase;
151                         newword->prehead->before = previous->words;
152                         newword->prehead->next = NULL;
153                         newword->prehead->number = 1;
154                     }
155                     previous = newword;
156                 }
157                                         
158             }
159 
160             //當前單詞置空
161             for (int j = 0; present[j] && j < 1024; j++)
162                 present[j] = 0;        
163         }        
164         length = 0;
165     }
166 
167     fclose(fp);    
168     return 1;
169 }

getwords

邊讀邊分析是否為單詞，查詢、插入、修改單詞，並且保存每個單詞結構體又拉一條鏈，存儲與它構成詞組的單詞信息。

3.2.3 讀取單詞和詞組（優化版）

  1 int getwords(char *path, struct wordsdata **word)
  2 {
  3     FILE *fp;
  4     int j = 0;
  5     int cmp = 0;
  6     int num = 0;               //計算首四位地址
  7     char temp = 0;             //讀取一個字符 ACSII 碼值
  8     int length = 0;
  9 
 10     char present[1024] = "";  //存儲當前單詞
 11 
 12     char address[4] = "";
 13     struct wordsdata *q = NULL;
 14     struct wordsdata *pre = NULL;
 15     struct wordsdata *neword = NULL;
 16     struct wordsdata *now = NULL;
 17     struct wordsdata *previous = NULL;
 18     struct phrases *newphrase = NULL;
 19 
 20     if ((fp = fopen(path, "r")) == NULL)
 21     {
 22         //printf("error!!! \n", path);
 23         return 0;
 24     }
 25     linenum++;
 26     while (temp != -1)
 27     {
 28         //讀取字符串
 29         temp = fgetc(fp);
 30         if (temp > 31 && temp < 127)
 31             charnum++;
 32         if (temp == '\n' || temp == '\r')
 33             linenum++;
 34 
 35         while ((temp >= '0' && temp <= '9') || (temp >= 'a' && temp <= 'z') || (temp >= 'A' && temp <= 'Z'))
 36         {
 37             if (length != -1 && length < 4)
 38             {
 39                 if (temp >= 'A')  //是字母
 40                 {
 41                     present[length] = temp;
 42                     address[length] = (temp >= 'a' ? (temp - 'a') : (temp - 'A'));
 43                     length++;
 44                 }
 45                 else            //不是字母
 46                     length = -1;
 47             }
 48             else if (length >= 4)
 49             {
 50                 present[length] = temp;
 51                 length++;
 52             }
 53             temp = fgetc(fp);
 54             if (temp > 31 && temp < 127)
 55                 charnum++;
 56             if (temp == '\n' || temp == '\r')
 57                 linenum++;
 58         } // end while
 59 
 60           //判斷是否為單詞
 61         if (length >= 4)
 62         {
 63             wordnum++;
 64 
 65             //計算首四位代表地址
 66             num = address[0] * 17576 + address[1] * 676 + address[2] * 26 + address[3];
 67 
 68             //插入當前單詞
 69             if (word[num] == NULL)
 70             {
 71                 word[num] = new wordsdata;
 72                 neword = new wordsdata;
 73                 neword->number = 1;
 74                 neword->next = NULL;
 75                 strcpy(neword->words, present);
 76                 word[num]->next = neword;
 77                 now = neword;
 78             }
 79             else
 80             {
 81                 pre = word[num];
 82                 q = pre->next;
 83                 cmp = wordcmp(q->words, present);
 84 
 85                 while (cmp == small)
 86                 {
 87                     pre = q;
 88                     q = q->next;
 89                     if (q != NULL)
 90                         cmp = wordcmp(q->words, present);
 91                     else
 92                         break;
 93                 }
 94                 if (q != NULL && cmp <= 1)
 95                 {
 96                     now = q;
 97                     q->number++;
 98                     if (cmp == 1)
 99                         strcpy(q->words, present);                
100                 }
101 
102                 else
103                 {
104                     neword = new wordsdata;
105                     neword->number = 1;
106                     strcpy(neword->words, present);
107                     pre->next = neword;
108                     neword->next = q;
109                     now = neword;
110                 }
111             }
112 
113             if (previous != NULL)
114             {
115                 newphrase = new phrases;
116 
117                 newphrase->tw = now->words;
118                 newphrase->one = previous->words;
119                 
120                 unordered_map<phrases, int>::const_iterator got = phrasemap.find( *newphrase);
121                 if (got != phrasemap.end())
122                 {
123                     phrasemap[*newphrase]++;
124                 }
125                 else
126                 {
127                     phrasemap.insert(pair<phrases, int>(*newphrase, 1));
128                 }
129             }
130             previous = now;
131 
132             //當前單詞置空
133             for (int j = 0; present[j] && j < 1024; j++)
134                 present[j] = 0;
135         }
136         length = 0;
137     }
138 
139     fclose(fp);
140     return 1;
141 }

getwords

邊讀邊分析是否為單詞，查詢、插入、修改單詞，用 unordered_map 處理詞組，構建哈希表，查詢、插入、修改詞組。

3.2.4 比較單詞序

 1 int wordcmp(char *str1, char *str2)
 2 {
 3     char *p1 = str1;
 4     char *p2 = str2;
 5     char q1 = *p1;
 6     char q2 = *p2;
 7 
 8     if (q1 >= 'a' && q1 <= 'z')
 9         q1 -= 32;
10 
11     if (q2 >= 'a' && q2 <= 'z')
12         q2 -= 32;
13 
14     while (q1 && q2 && q1 == q2)
15     {
16         p1++;
17         p2++;
18 
19         q1 = *p1;
20         q2 = *p2;
21 
22         if (q1 >= 'a' && q1 <= 'z')
23             q1 -= 32;
24 
25         if (q2 >= 'a' && q2 <= 'z')
26             q2 -= 32;
27     }
28 
29     while (*p1 >= '0' && *p1 <= '9')
30         p1++;
31     while (*p2 >= '0' && *p2 <= '9')
32         p2++;
33 
34     if (*p1 == 0 && *p2 == 0)           //兩單詞等價    
35         return strcmp(str1, str2);       //等價前者字典順序小返回-1，大返回1，完全相等返回0
36 
37     if (q1 < q2)                   //前者小
38         return 2;
39 
40     if (q1 > q2)                   //后者小
41         return 3;
42 
43     return 4;
44 }

wordscmp

比較單詞是否等價，以及字典序。

3.2.5 獲取 top10 （原始版）

 1 int gettop(struct wordsdata **word)
 2 {
 3     int i = 0, j = 0;
 4     struct phrasetop *ph[12] = {};
 5     struct wordsdata *top[12] = {};
 6     struct phrase *p = NULL;
 7     struct wordsdata *w = NULL;
 8     
 9     for (j = 0; j < 12; j++)
10     {
11         ph[j] = new struct phrasetop;
12         ph[j]->number = 0;
13         ph[j]->first = "";
14         ph[j]->last = "";
15         top[j] = new struct wordsdata;
16         top[j]->number = 0;
17     }
18     for (int i = 0; i < 456976; i++)
19     {
20         if (word[i] != NULL)
21         {
22             w = word[i]->next;    
23             while (w != NULL)
24             {
25                 top[11]->number = w->number;
26                 top[11]->next = w;
27                 j = 11;
28                 while (j > 1 && top[j]->number > top[j - 1]->number)
29                 {
30                     top[0] = top[j];
31                     top[j] = top[j - 1];
32                     top[j - 1] = top[0];
33                     j--;
34                 }
35                 p = w->prehead;
36                 while (p != NULL)
37                 {                
38                     ph[11]->number = p->number;
39                     ph[11]->last = w->words;
40                     ph[11]->first = p->before;
41                     j = 11;
42                     while (j > 1 && (ph[j]->number > ph[j - 1]->number))
43                     {
44                         ph[0] = ph[j];
45                         ph[j] = ph[j - 1];
46                         ph[j - 1] = ph[0];        
47                         j--;
48                     }
49                     p = p->next;
50                 }
51                 w = w->next;
52             }
53             
54         }        
55         
56     }
57     
58     for (j = 1; j < 11; j++)
59     {        
60         if (top[j])
61             printf("\n%s :%d", top[j]->next->words, top[j]->number);
62     }
63     for (j = 1; j < 11; j++)
64     {
65         if (ph[j])
66             printf("\n%s %s:%d ", ph[j]->first, ph[j]->last, ph[j]->number);
67     }
68     return 1;
69 }

gettop

鏈表一次遍歷查找單詞、詞組前十。

3.2.6 獲取 top10 （優化版）

 1 int gettop(struct wordsdata **word)
 2 {
 3     int i = 0, j = 0;
 4     struct wordsdata *topw[12] = {};
 5     struct phrases *toph[12] = {};
 6     struct wordsdata *w = NULL;
 7     FILE *fp;
 8     fp = fopen("result.txt", "w");
 9     fprintf(fp,"characters:%d \nwords:%d \nlines:%d\n",  charnum,wordnum, linenum);
10 
11     for (j = 0; j < 12; j++)
12     {        
13         toph[j] = new struct phrases;
14         toph[j]->num = 0;
15         topw[j] = new struct wordsdata;
16         topw[j]->number = 0;
17     }
18     for (i = 0; i < 456976; i++)
19     {
20         if (word[i] != NULL)
21         {
22             w = word[i]->next;
23             while (w != NULL)
24             {
25                 topw[11]->number = w->number;
26                 topw[11]->next = w;
27                 j = 11;
28                 while (j > 1 && topw[j]->number > topw[j - 1]->number)
29                 {
30                     topw[0] = topw[j];
31                     topw[j] = topw[j - 1];
32                     topw[j - 1] = topw[0];
33                     j--;
34                 }
35                 w = w->next;
36             }
37         }
38     }
39     for (j = 1; j < 11; j++)
40     {
41         if (topw[j]->number)
42             fprintf(fp,"\n%s :%d", topw[j]->next->words, topw[j]->number);
43     }
44     for (Char_Phrase::iterator it = phrasemap.begin(); it != phrasemap.end(); it++)
45     {
46         toph[11]->one = it->first.one;
47         toph[11]->tw = it->first.tw;
48         toph[11]->num = it->second;
49         j = 11;
50         while (j > 1 && toph[j]->num > toph[j - 1]->num)
51         {
52             toph[0] = toph[j];
53             toph[j] = toph[j - 1];
54             toph[j - 1] = toph[0];
55             j--;
56         }
57     }
58     fprintf(fp, "\n");
59     for (j = 1; j < 11; j++)
60     {
61         if (toph[j]->num)
62             fprintf(fp,"\n%s %s :%d", toph[j]->one, toph[j]->tw, toph[j]->num);
63     }
64     fclose(fp);
65     return 1;
66 }

gettop

鏈表一次遍歷查找單詞，哈希一次遍歷查找詞組前十。

4. 測試結果

4.1 原始版

自己任意設計的測試樣例，都可以通過。助教樣例中，單詞和詞組的頻率前十均正確，但行數、字符數、單詞數結果均有誤差數量級約 100。

4.1 優化版

助教樣例中，單詞頻率前十均正確，但行數、字符數、單詞數結果不變，詞組頻率前十誤差較大。自己調試時發現在 hash 表運用時出現 bug ，運用斷點、調試窗口等，發現哈希表的查詢出現 bug ，似乎並不能按照我設定的哈西比較函數進行查詢，但已經到了周四下午，后來沒時間解決了，后續還會繼續改進。

5· 性能分析

5.1 原始版

啟動 VS2017 性能分析時，發現遞歸打開文件夾和讀取單詞調用較多，這在情理之中，每層遞歸都要處理完整個文件單詞和詞組。這一點上難做大的改動。

詞組查詢太過於耗時間，原始版中，詞組是單詞后面再拉鏈，頭插法，每次遍歷查找，寫之前就知道時間復雜度很高，只是想實現它，並不對它的性能抱希望，結果也在情理之中。因此在臨近 DDL 的時候，在不太會用 STL 庫的時候，我還是毅然地選擇了大幅改寫程序。

另外，單詞比較的程序被調用很多次，但這幾乎也不能避免，因為任務要求中對單詞的限定條件較多，因此比較起來較為麻煩，最好就是至多一次遍歷就得出結果。

5.2 優化版

剛大幅改為哈希，沒做細節上的優化，30 s 可以跑完，可以看到運行速度有了提升，雖然還是較慢，除了打開文件夾讀取單詞的函數調用最多外，哈希表簡化了詞組操作和時間復雜度。但是對於表中查詢的 bug 我將會繼續尋找解決。

另外 fgetc() 函數可能較慢，換用 fread() 可能更好，后面可以再做嘗試，還有其他細節優化沒來得及改善，后面會繼續。

6. 總結

6.1 反思

1）時間復雜度

對於單詞的查詢，速度還是比較快，畢竟有 26*26*26*26 = 456976 個指針，而且對應關系唯一，順序查找時間也還能接受；但是加入詞組查詢之后，運行速度銳減，詞組查詢太過於耗時，因此后來還是決定改寫。在編程之前應該對問題做充分的思考，盡可能多提供幾種解決思路，如果迫於學的知識少思路受限，應該主動自學更多知識並學以致用，而對於這些思路，逐一考量時間復雜度，思考得盡可能全面，先讓程序在自己腦袋中跑起來。

2）編程效率

對於第一想法存在執念，在實現之后陷入尷尬，最后又進行了大幅修改，耗費了一定的時間，而學習了 unordered_map 的使用后，數據結構簡化，程序更快，編程時間還遠小於前者。原始版本的單詞首四位計算地址后拉一條鏈，單詞后面還拉鏈指向相關詞組，實現起來很多細節想想就繁瑣。但是在寫完單詞查詢之后，再加入詞組，很多問題我反應更加快了，一塊塊地編程調試，最終詞組查詢的編寫反而更順暢，對於助教樣例，一遍就得到了正確結果，雖然運行較慢。結構丑是丑了一點，但還是鍛煉了不少基本操作的能力，編程習慣有了改進，發現問題的能力也有所提高。但以后應該選用更加合理的數據結構，養成良好的編程習慣，提高效率。

3）面向對象

對第一想法的執念，某種程度上是智力上的懶惰的借口。再寫出一個難看的結構后，我毅然地決定放棄，在僅剩的時間里學習並嘗試了標准庫的使用，雖然出現 bug 但是最后我還是提交了優化后的一版。面對復雜問題的時候，還是應該將問題盡可能地抽象化，將程序進行結構化設計，合理運用數據結構，合理使用標准庫，如果不行就按照需求自己實現一些庫。

6.2 課程建議

個人建議以后代碼和優化可以分別提交，例如：該周四布置代碼任務，下周四必須交一版程序，然后可以給再給一定的優化時間，比如三天后的周日提交最終優化版本。

IF（第一次提交效果就很好）　　這部分同學可以展現優勢，也能提高優化的積極性；

ELSE IF（能堅持做出優化）　　很多意義並不能純粹用結果去衡量，鼓勵學生用於學習新知識，耐心發現自己的問題，或者進行優化改進可能也有意義；

ELSE　　也能給部分學生一定的緩沖時間，因為我們並不只學軟工這一門課程，在以軟工為綱的大躍進之下，也要兼顧我科的數理基礎。

6.2 論一條咸魚的自我修養（扯淡雜談）

咸魚是否真的咸，也許並不重要，如果說面對任務反正都不能閑的話，還不如勇於去學習新的知識，至少不能放棄試圖對觀念意識和基本能力做出提升。因為以后面臨的問題將會更復雜，而知識畢竟有限，在當前情形下，結果並不重要，分數就更不重要了，拒絕智力上的懶惰以求進步的姿態或許也有意義。比如我們的大項目作業——校園二手交易和信息共享網站平台，面臨的問題更加復雜，所學的知識更加局限，但我反而更加期望去學習更多的知識，時刻理清邏輯結構，然后逐步客克服困難，當一個產品擺在眼前的時候，一定會有驕傲和喜悅。當我們深刻地意識到日益增長的任務需求和個人生產力之間的矛盾是，意識到有些事情掌控起來並不是想象的那么簡單時，仍能夠內心毫不抗拒地至少去做，這也是咸魚的自我修養。

周四之前用於編程任務，周五上午下午和晚上都有事情，因此提交得遲了一點，因為出於強迫症並不想草率提交博客，事實上根據深厚的吹逼造詣也不是不行，上次周三讀書筆記瞎扯一通，鄧老師還贊了，也可以先提交后修改，但我本質上是個克制的體面人，因此還是認真記錄了一條咸魚的歷程。

另外，在此立下 flag，以后認真自學 C++、Python 和 Java，提高編程的觀念和意識。網站后端采用 Java，希望對它有新的理解，Python 運用較廣人工智能也會大量應用，R.O.S 是基於 C++ 和 Python 寫的，有時間一定要在實驗室掌握 R.O.S 的使用。

7· 附錄

7.1 PSP

PSP	Stages	預估耗時（min）	實際耗時（min）
· Planning	· 計划	20	30
· Estimate	· 估計任務時間	10	10
· Analysis	· 需求分析	1 * 60	2 * 60
· Design Spec	· 生成設計文檔	20	10
· Design Review	· 設計復審	20	30
· Coding Standard	· 代碼規范	10	20
· Design	· 具體設計	30	1 * 60
· Coding	· 具體編碼	8 * 60	10 * 60
· Code Review	· 代碼復審	30	40
· Test	· 測試（測試，修改，提交）	5 * 60	7 * 60
· Reporting	· 報告	3 * 60	4 * 60
· Size Measurement	· 計算工作量	10	10
All	總記	1170	1580

7·2 提交版源碼

  1 #include "io.h"
  2 #include "math.h"
  3 #include "stdio.h"
  4 #include "string.h"
  5 #include "stdlib.h"
  6 #include "unordered_map"
  7 
  8 using namespace std;
  9    
 10 #define small 2
 11 
 12 int wordnum = 0;
 13 int charnum = 0;
 14 int linenum = 0;
 15 
 16 struct wordsdata                //存放單詞信息
 17 {
 18     char words[1024];           //單詞字符串
 19     int number;                 //出現次數
 20     wordsdata *next;
 21 };
 22 struct phrases
 23 {
 24     char *one;
 25     char *tw;
 26     int num;
 27 };
 28 
 29 int wordcmp(char *str1, char *str2);
 30 int gettop(struct wordsdata **word);
 31 int getwords(char *path, struct wordsdata **word);
 32 int getfiles(char *path, struct _finddata_t *fileinfo, long handle);
 33 
 34 struct phrase_cmp
 35 {
 36     bool operator()(const phrases &p1, const phrases &p2) const
 37     {
 38         return ((wordcmp(p1.one, p2.one) < 2) && (wordcmp(p1.tw, p2.tw) < 2));
 39     }
 40 
 41 };
 42 struct phrase_hash
 43 {
 44     size_t operator()(const phrases &ph) const
 45     {
 46         unsigned long __h = 0;
 47         int temp;
 48         size_t i;
 49         for (i = 0; ph.one[i]; i++)
 50         {
 51             temp = ph.one[i];
 52             if (temp > 64)
 53             {
 54                 (temp > 96) ? (temp - 96) : (temp - 64);
 55                 __h += (29 * __h + temp);
 56                 __h %= 2147483647;
 57             }
 58 
 59         }
 60         for (i = 0; ph.tw[i]; i++)
 61         {
 62             temp = ph.tw[i];
 63             if (temp > 64)
 64             {
 65                 (temp > 96) ? (temp - 96) : (temp - 64);
 66                 __h += (29 * __h + temp);
 67                 __h %= 2147483647;
 68             }
 69         }
 70 
 71         return size_t(__h);
 72     }
 73 
 74 };
 75 
 76 typedef unordered_map<phrases, int, phrase_hash, phrase_cmp> Char_Phrase;
 77 Char_Phrase phrasemap;
 78 struct wordsdata *fourletter[26 * 26 * 26 * 26] = {}; //按首四字母排序
 79 
 80 int main()
 81 {
 82     int j = 0;                            
 83     long handle = 0;                           // 用於查找的句柄 
 84     struct _finddata_t fileinfo;               // 文件信息的結構體 
 85     char *path = __argv[1];
 86     
 87     getfiles(path, &fileinfo, handle);
 88 
 89     gettop(fourletter);
 90 
 91     system("pause");
 92     return 1;
 93 }
 94 
 95 int getfiles(char *path, struct _finddata_t *fileinfo, long handle)
 96 {                                    
 97     handle = _findfirst(path, fileinfo);            //第一次打開父目錄
 98     if (handle == -1)
 99         return -1;
100 
101 
102     do
103     {
104         //printf("> %s\n", path);           //顯示目錄名
105 
106         if (fileinfo->attrib & _A_SUBDIR)           //如果讀取到子目錄
107         {
108             if (strcmp(fileinfo->name, ".") != 0 && strcmp(fileinfo->name, "..") != 0)
109             {
110                 char temppath[1024] = "";              //記錄子目錄路徑
111                 long temphandle = 0;
112                 struct _finddata_t tempfileinfo;
113                 strcpy(temppath, path);
114                 strcat(temppath, "/*");
115 
116                 temphandle = _findfirst(temppath, &tempfileinfo);  //第一次打開子目錄
117                 if (temphandle == -1)
118                     return -1;
119 
120                 do                              //對子目錄所有文件遞歸
121                 {
122                     if (strcmp(tempfileinfo.name, ".") != 0 && strcmp(tempfileinfo.name, "..") != 0)
123                     {
124                         strcpy(temppath, path);
125                         strcat(temppath, "/");
126                         strcat(temppath, tempfileinfo.name);
127                         getfiles(temppath, &tempfileinfo, temphandle);
128                     }
129                 } while (_findnext(temphandle, &tempfileinfo) != -1);
130 
131                 _findclose(temphandle);
132             }//遞歸完畢
133 
134         } //子目錄讀取完畢
135         else
136             getwords(path, fourletter);
137 
138 
139     } while (_findnext(handle, fileinfo) != -1);
140 
141     _findclose(handle);       //關閉句柄
142 
143     return 1;
144 
145 }
146 
147 int getwords(char *path, struct wordsdata **word)
148 {
149     FILE *fp;
150     int j = 0;
151     int cmp = 0;
152     int num = 0;               //計算首四位地址
153     char temp = 0;             //讀取一個字符 ACSII 碼值
154     int length = 0;
155 
156     char present[1024] = "";  //存儲當前單詞
157 
158     char address[4] = "";
159     struct wordsdata *q = NULL;
160     struct wordsdata *pre = NULL;
161     struct wordsdata *neword = NULL;
162     struct wordsdata *now = NULL;
163     struct wordsdata *previous = NULL;
164     struct phrases *newphrase = NULL;
165 
166     if ((fp = fopen(path, "r")) == NULL)
167     {
168         //printf("error!!! \n", path);
169         return 0;
170     }
171     linenum++;
172     while (temp != -1)
173     {
174         //讀取字符串
175         temp = fgetc(fp);
176         if (temp > 31 && temp < 127)
177             charnum++;
178         if (temp == '\n' || temp == '\r')
179             linenum++;
180 
181         while ((temp >= '0' && temp <= '9') || (temp >= 'a' && temp <= 'z') || (temp >= 'A' && temp <= 'Z'))
182         {
183             if (length != -1 && length < 4)
184             {
185                 if (temp >= 'A')  //是字母
186                 {
187                     present[length] = temp;
188                     address[length] = (temp >= 'a' ? (temp - 'a') : (temp - 'A'));
189                     length++;
190                 }
191                 else            //不是字母
192                     length = -1;
193             }
194             else if (length >= 4)
195             {
196                 present[length] = temp;
197                 length++;
198             }
199             temp = fgetc(fp);
200             if (temp > 31 && temp < 127)
201                 charnum++;
202             if (temp == '\n' || temp == '\r')
203                 linenum++;
204         } // end while
205 
206           //判斷是否為單詞
207         if (length >= 4)
208         {
209             wordnum++;
210 
211             //計算首四位代表地址
212             num = address[0] * 17576 + address[1] * 676 + address[2] * 26 + address[3];
213 
214             //插入當前單詞
215             if (word[num] == NULL)
216             {
217                 word[num] = new wordsdata;
218                 neword = new wordsdata;
219                 neword->number = 1;
220                 neword->next = NULL;
221                 strcpy(neword->words, present);
222                 word[num]->next = neword;
223                 now = neword;
224             }
225             else
226             {
227                 pre = word[num];
228                 q = pre->next;
229                 cmp = wordcmp(q->words, present);
230 
231                 while (cmp == small)
232                 {
233                     pre = q;
234                     q = q->next;
235                     if (q != NULL)
236                         cmp = wordcmp(q->words, present);
237                     else
238                         break;
239                 }
240                 if (q != NULL && cmp <= 1)
241                 {
242                     now = q;
243                     q->number++;
244                     if (cmp == 1)
245                         strcpy(q->words, present);                
246                 }
247 
248                 else
249                 {
250                     neword = new wordsdata;
251                     neword->number = 1;
252                     strcpy(neword->words, present);
253                     pre->next = neword;
254                     neword->next = q;
255                     now = neword;
256                 }
257             }
258 
259             if (previous != NULL)
260             {
261                 newphrase = new phrases;
262 
263                 newphrase->tw = now->words;
264                 newphrase->one = previous->words;
265                 
266                 unordered_map<phrases, int>::const_iterator got = phrasemap.find( *newphrase);
267                 if (got != phrasemap.end())
268                 {
269                     phrasemap[*newphrase]++;
270                 }
271                 else
272                 {
273                     phrasemap.insert(pair<phrases, int>(*newphrase, 1));
274                 }
275             }
276             previous = now;
277 
278             //當前單詞置空
279             for (int j = 0; present[j] && j < 1024; j++)
280                 present[j] = 0;
281         }
282         length = 0;
283     }
284 
285     fclose(fp);
286     return 1;
287 }
288 
289 int wordcmp(char *str1, char *str2)
290 {
291     char *p1 = str1;
292     char *p2 = str2;
293     char q1 = *p1;
294     char q2 = *p2;
295 
296     if (q1 >= 'a' && q1 <= 'z')
297         q1 -= 32;
298 
299     if (q2 >= 'a' && q2 <= 'z')
300         q2 -= 32;
301 
302     while (q1 && q2 && q1 == q2)
303     {
304         p1++;
305         p2++;
306 
307         q1 = *p1;
308         q2 = *p2;
309 
310         if (q1 >= 'a' && q1 <= 'z')
311             q1 -= 32;
312 
313         if (q2 >= 'a' && q2 <= 'z')
314             q2 -= 32;
315     }
316 
317     while (*p1 >= '0' && *p1 <= '9')
318         p1++;
319     while (*p2 >= '0' && *p2 <= '9')
320         p2++;
321 
322     if (*p1 == 0 && *p2 == 0)   //兩單詞等價    
323         return strcmp(str1, str2);  //等價前者字典順序小返回-1，大返回1，完全相等返回0
324 
325     if (q1 < q2)                   //前者小
326         return 2;
327 
328     if (q1 > q2)                   //后者小
329         return 3;
330 
331     return 4;
332 }
333 
334 int gettop(struct wordsdata **word)
335 {
336     int i = 0, j = 0;
337     struct wordsdata *topw[12] = {};
338     struct phrases *toph[12] = {};
339     struct wordsdata *w = NULL;
340     FILE *fp;
341     fp = fopen("result.txt", "w");
342     fprintf(fp,"characters:%d \nwords:%d \nlines:%d\n",  charnum,wordnum, linenum);
343 
344     for (j = 0; j < 12; j++)
345     {        
346         toph[j] = new struct phrases;
347         toph[j]->num = 0;
348         topw[j] = new struct wordsdata;
349         topw[j]->number = 0;
350     }
351     for (i = 0; i < 456976; i++)
352     {
353         if (word[i] != NULL)
354         {
355             w = word[i]->next;
356             while (w != NULL)
357             {
358                 topw[11]->number = w->number;
359                 topw[11]->next = w;
360                 j = 11;
361                 while (j > 1 && topw[j]->number > topw[j - 1]->number)
362                 {
363                     topw[0] = topw[j];
364                     topw[j] = topw[j - 1];
365                     topw[j - 1] = topw[0];
366                     j--;
367                 }
368                 w = w->next;
369             }
370         }
371     }
372     for (j = 1; j < 11; j++)
373     {
374         if (topw[j]->number)
375             fprintf(fp,"\n%s :%d", topw[j]->next->words, topw[j]->number);
376     }
377     for (Char_Phrase::iterator it = phrasemap.begin(); it != phrasemap.end(); it++)
378     {
379         toph[11]->one = it->first.one;
380         toph[11]->tw = it->first.tw;
381         toph[11]->num = it->second;
382         j = 11;
383         while (j > 1 && toph[j]->num > toph[j - 1]->num)
384         {
385             toph[0] = toph[j];
386             toph[j] = toph[j - 1];
387             toph[j - 1] = toph[0];
388             j--;
389         }
390     }
391     fprintf(fp, "\n");
392     for (j = 1; j < 11; j++)
393     {
394         if (toph[j]->num)
395             fprintf(fp,"\n%s %s :%d", toph[j]->one, toph[j]->tw, toph[j]->num);
396     }
397     fclose(fp);
398     return 1;
399 }

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java 詞頻統計代碼代碼規范+詞頻統計點評詞頻統計文本詞頻統計 jieba庫詞頻統計【Python】詞頻統計詞頻統計（python）中文詞頻統計詞頻統計更新 spark----詞頻統計(一)

詞頻統計代碼任務總結

第一次個人代碼任務總結

——論一條咸魚的自我修養

1· 項目要求

1.1 基本功能

1.2 注意事項

2· 思路歷程

2.1 遍歷文件夾

2.2 讀取單詞

2.3 讀取詞組

2.4 頻率統計

3· 具體實現

3.1 數據結構

3.1.1 單詞 (原始版）

3.1.2 詞組（原始版）

3.1.3 單詞 （優化版）

3.1.4 詞組 （優化版）

3.1.5 哈希（優化版）

3.2 主要函數

3.2.1 遍歷文件夾

3.2.2 讀取單詞和詞組 （原始版）

3.2.3 讀取單詞和詞組 （優化版）

3.2.4 比較單詞序

3.2.5 獲取 top10 （原始版）

3.2.6 獲取 top10 （優化版）

4. 測試結果

4.1 原始版

4.1 優化版

5· 性能分析

5.1 原始版

5.2 優化版

6. 總結

6.1 反思

6.2 課程建議

6.2 論一條咸魚的自我修養（扯淡雜談）

7· 附錄

7.1 PSP

7·2 提交版源碼

免責聲明！

1.2 注意事項　

3.1.3 單詞（優化版）

3.1.4 詞組（優化版）

3.2.2 讀取單詞和詞組（原始版）

3.2.3 讀取單詞和詞組（優化版）