1.關鍵詞提取
github地址:https://github.com/yanyiwu/cppjieba
1.切詞
2.過濾掉單個字的詞和停用詞
3.時使用TF-IDF計算,TF為 (詞頻* 1.0),IDF從外部文件的詞表中獲得如果不存在就賦為平均的IDF
代碼如下
void Extract(const string& sentence, vector<Word>& keywords, size_t topN) const { vector<string> words; segment_.Cut(sentence, words); map<string, Word> wordmap; size_t offset = 0; for (size_t i = 0; i < words.size(); ++i) { size_t t = offset; offset += words[i].size(); if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) { continue; } wordmap[words[i]].offsets.push_back(t); wordmap[words[i]].weight += 1.0; } if (offset != sentence.size()) { XLOG(ERROR) << "words illegal"; return; } keywords.clear(); keywords.reserve(wordmap.size()); for (map<string, Word>::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) { unordered_map<string, double>::const_iterator cit = idfMap_.find(itr->first); if (cit != idfMap_.end()) { itr->second.weight *= cit->second; } else { itr->second.weight *= idfAverage_; } itr->second.word = itr->first; keywords.push_back(itr->second); } topN = min(topN, keywords.size()); partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare); keywords.resize(topN); }
2.詞典說明:
## 分詞
### jieba.dict.utf8/gbk
作為最大概率法(MPSegment: Max Probability)分詞所使用的詞典。
### hmm_model.utf8/gbk
作為隱式馬爾科夫模型(HMMSegment: Hidden Markov Model)分詞所使用的詞典。
__對於MixSegment(混合MPSegment和HMMSegment兩者)則同時使用以上兩個詞典__
## 關鍵詞抽取
### idf.utf8
IDF(Inverse Document Frequency)
在KeywordExtractor中,使用的是經典的TF-IDF算法,所以需要這么一個詞典提供IDF信息。
### stop_words.utf8
停用詞詞典