1.关键词提取
github地址:https://github.com/yanyiwu/cppjieba
1.切词
2.过滤掉单个字的词和停用词
3.时使用TF-IDF计算,TF为 (词频* 1.0),IDF从外部文件的词表中获得如果不存在就赋为平均的IDF
代码如下
void Extract(const string& sentence, vector<Word>& keywords, size_t topN) const { vector<string> words; segment_.Cut(sentence, words); map<string, Word> wordmap; size_t offset = 0; for (size_t i = 0; i < words.size(); ++i) { size_t t = offset; offset += words[i].size(); if (IsSingleWord(words[i]) || stopWords_.find(words[i]) != stopWords_.end()) { continue; } wordmap[words[i]].offsets.push_back(t); wordmap[words[i]].weight += 1.0; } if (offset != sentence.size()) { XLOG(ERROR) << "words illegal"; return; } keywords.clear(); keywords.reserve(wordmap.size()); for (map<string, Word>::iterator itr = wordmap.begin(); itr != wordmap.end(); ++itr) { unordered_map<string, double>::const_iterator cit = idfMap_.find(itr->first); if (cit != idfMap_.end()) { itr->second.weight *= cit->second; } else { itr->second.weight *= idfAverage_; } itr->second.word = itr->first; keywords.push_back(itr->second); } topN = min(topN, keywords.size()); partial_sort(keywords.begin(), keywords.begin() + topN, keywords.end(), Compare); keywords.resize(topN); }
2.词典说明:
## 分词
### jieba.dict.utf8/gbk
作为最大概率法(MPSegment: Max Probability)分词所使用的词典。
### hmm_model.utf8/gbk
作为隐式马尔科夫模型(HMMSegment: Hidden Markov Model)分词所使用的词典。
__对于MixSegment(混合MPSegment和HMMSegment两者)则同时使用以上两个词典__
## 关键词抽取
### idf.utf8
IDF(Inverse Document Frequency)
在KeywordExtractor中,使用的是经典的TF-IDF算法,所以需要这么一个词典提供IDF信息。
### stop_words.utf8
停用词词典