詞頻統計

參考：
對參考的代碼進行了一些補充和修改，大體思路沒變
並不是想說這題多難，只是在這題可以用STL的很多結構幫助解題，所以能讓代碼變得很簡單

題目

7-1 詞頻統計 (30 分)

請編寫程序，對一段英文文本，統計其中所有不同單詞的個數，以及詞頻最大的前10%的單詞。
所謂“單詞”，是指由不超過80個單詞字符組成的連續字符串，但長度超過15的單詞將只截取保留前15個單詞字符。而合法的“單詞字符”為大小寫字母、數字和下划線，其它字符均認為是單詞分隔符。
輸入格式:
輸入給出一段非空文本，最后以符號#結尾。輸入保證存在至少10個不同的單詞。
輸出格式:
在第一行中輸出文本中所有不同單詞的個數。注意“單詞”不區分英文大小寫，例如“PAT”和“pat”被認為是同一個單詞。
隨后按照詞頻遞減的順序，按照詞頻:單詞的格式輸出詞頻最大的前10%的單詞。若有並列，則按遞增字典序輸出。
輸入樣例：
This is a test.

The word "this" is the word with the highest frequency.

Longlonglonglongword should be cut off, so is considered as the same as longlonglonglonee. But this_8 is different than this, and this, and this...#
this line should be ignored.
輸出樣例：（注意：雖然單詞the也出現了4次，但因為我們只要輸出前10%（即23個單詞中的前2個）單詞，而按照字母序，the排第3位，所以不輸出。）
23
5:this
4:is

思路

題目要求記錄詞頻，那就記錄每個詞出現的次數
由於‘#’表示結束，那么就要逐字讀入進行判斷，但這要這么變成一個單詞呢？只要把每個讀到的字符接在字符串的后面就行了
然后可以構建一個列表，表中有每個單詞及其出現次數

用到的結構及函數

begin()和end()函數

基本上線性存儲類類型都會有的結構（vector，map），其實就是快速定位到第一個（通常下標是0）和最后一個數據。要注意的是他們可以直接進行加減運算
通常在初始化或者函數中用到

vector<pair<string ,int>> v(ma.begin(), ma.end())；    //定義一個pair<string ,int>類型的數組，並且數組第一個數據為ma的第一個數據，最后一個數據為ma的最后一個數據
sort(v.begin(), v.end(), cmp);    //將v從v 的第一個到最后一個數據根據自定義的比較函數cmp進行分類。sort的自定義比較函數之后會說
//來自C++Reference的樣例
int myints[] = {32,71,12,45,26,80,53,33};
  std::vector<int> myvector (myints, myints+8);               // 32 71 12 45 26 80 53 33 （對前八個數排序）

  std::sort (myvector.begin(), myvector.begin()+4);           //(12 32 45 71)26 80 53 33 （對前四個數排序）

pair（對組）

pair是C++定義的模板類型，可以同時存儲兩個類型的數據，其實可以用結構體實現的呢

pair<string, int> a    //定義
a.firts="Hello World";    //對第一個數據進行操作
a.second="3";             //對第二個數據進行操作

map

先來看看C++reference的定義

/*
Maps are associative containers that store elements formed by a combination of a key value and a mapped value, following a specific order.
In a map, the key values are generally used to sort and uniquely identify the elements, while the mapped values store the content associated to this key. 
The types of key and mapped value may differ, and are grouped together in member type value_type, which is a pair type combining both:
*/
 
typedef pair<const Key, T> value_type;

其實，我們可以把map理解為一個容器，其內部存的是一組鍵值對，即兩個不同類型的數據。
可以理解為“關鍵字”以及“關鍵字的值”（沒錯，和pair很像）

vector（容器）

vector可以看成加強版數組，用於儲存相同類型的多個數據。這里要介紹他的排序函數sort()
從C++reference中可以找到以下例子

 int myints[] = {32,71,12,45,26,80,53,33};
  std::vector<int> myvector (myints, myints+8);               //初始化為 32 71 12 45 26 80 53 33（myints的前八個放入vector）

  // using default comparison (operator <):
  std::sort (myvector.begin(), myvector.begin()+4);           //(12 32 45 71)26 80 53 33

  // using function as comp
  std::sort (myvector.begin()+4, myvector.end(), myfunction); // 12 32 45 71(26 33 53 80)

  // using object as comp
  std::sort (myvector.begin(), myvector.end(), myobject);     //(12 26 32 33 45 53 71 80)

可以看出sort()其實可以自己加函數進行排序，不過如果自己每加的話就默認非降序排列。
那如果我要自己加函數要怎么加呢？
參考：【C++】從最簡單的vector中sort用法到自定義比較函數comp后對結構體排序的sort算法
首先自定義比較函數的返回值是bool型的，這里給出一個例子

    bool comp(int a,int b){
        return a>b;
    }

    sort(v.begin(), v.end(), comp);

比較時sort函數根據comp函數進行判斷輸的大小，系統默認a<b時返回真，於是從小到大排，而我的comp函數設定為a>b時返回為真，那么最終得到的排序結果也相應的從小到大變成從大到小。其實可以這樣理解：排序結束后，a是前面的數，b是后面的數，我們的自定義函數是為了定義a與b的關系
再讓我們來看代碼中的例子

bool cmp(pair<string, int> a, pair <string, int> b) {
	bool result = false;
	if (a.second == b.second&&a.first < b.first) {
		result = true;
	}
	else if (a.second > b.second) {
		result = true;
	}
	return result;
}

sort(v.begin(), v.end(), cmp);

那么上述式子就表明，對vector中的第一個數據到最后一個數據排列。
因為vector中的數據類型是pair，根據自定義的比較函數，當pair的second（詞語出現次數）相等時，first（單詞）小的在前面（若有並列，則按遞增字典序輸出）。

代碼

#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <algorithm>

using namespace std;

bool cmp(pair<string, int> a, pair <string, int> b);

int main() {
	char ch;
	string s;	//字符串用於記錄一個單詞
	map<string, int> ma;	//map記錄詞頻，string代表的單詞出現次數為int
	do {
		ch = getchar();
		//當讀到的是合法字符（大小寫字母、數字下划線）
		if ((ch >= 'a'&&ch <= 'z') || (ch >= 'A'&&ch <= 'Z') || (ch >= '0'&&ch <= '9') || ch == '_') {
			if (s.size() <= 14) {	//當長度為14時再進行一次接入，長度為15就停止接入
				if (ch >= 'A'&&ch <= 'Z') {		//把大寫換成小寫
					ch += 32;
				}
				s += ch;	//把單個字符ch接到字符串s后，string中有運算符重載所以加法表示接在后面
			}
		}
		else {		//當不是合法字符就表示這個詞讀取結束了，出現次數+1
			if (s.size() > 0) {
				ma[s]++;
			}
			s.clear();		//清空字符串以統計下一個單詞
		}
		if (ch == '#') {	//讀到#退出循環
			break;
		}
	} while (ch != '#');
	vector<pair<string ,int>> v(ma.begin(), ma.end());        //存儲pair的一個數組（把vector理解為增強版的數組）
	sort(v.begin(), v.end(), cmp);
	cout << v.size() << endl;
	int cnt = (int)(ma.size()*0.1);
	for (int i = 0; i < cnt; i++) {
		cout << v[i].second << ":" << v[i].first << endl;
	}
	return 0;
}

//利用pair數據，每個pair數據都含有一個string數值和int數值
//
bool cmp(pair<string, int> a, pair <string, int> b) {
	bool result = false;
	if (a.second == b.second&&a.first < b.first) {
		result = true;
	}
	else if (a.second > b.second) {
		result = true;
	}
	return result;
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 PTA 詞頻統計（30 分） 7-44 基於詞頻的文件相似度 (30分) 詞頻統計 PAT（頂級）2020年秋季考試 7-1 Professional Ability Test (30分) （原創）7-1 銀行業務隊列簡單模擬（30 分) 進階實驗5-3.3 基於詞頻的文件相似度 (30分)-哈希 7-1 幣值轉換（20 分）文本詞頻統計 jieba庫詞頻統計【Python】詞頻統計