IKAnalyzer 獨立使用配置擴展詞典

本文轉載自查看原文 2014-07-22 13:55 3496

有三點要注意（要不然擴展詞典始終不生效）：

后綴名.dic的詞典文件，必須如使用文檔里所說的無BOM的UTF-8編碼保存的文件。如果不確定什么是無BOM的UTF-8編碼，最簡單的方式就是用Notepad++編輯器打開，Encoding->選擇 Encoding in UTF-8 without BOM，然后保存。
項目preferences 里編碼選擇 utf8。
詞典和IKAnalyzer.cfg.xml配置文件的路徑問題。IKAnalyzer.cfg.xml必須在src根目錄下。詞典可以任意放，但是在IKAnalyzer.cfg.xml里要配置對。如下：我的兩個詞典文件my.dic 和 mine.dic 放在src下的com.org.config包下，注意com前面一定不要加/，否則是絕對路徑找不到。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 擴展配置</comment>

<entry key="ext_dict">com/org/config/my.dic;com/org/config/mine.dic;</entry>

<entry key="ext_stopwords">/com/org/config/stopword.dic</entry>
</properties>

IKAnalyzer 獨立使用的代碼：

package com.org;
 
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
 
public class IKAnalyzerTest {
public static void main(String[] args) {
String str = "最希望從企業得到的是獨家的內容或銷售信息，獲得打折或促銷信息等；最不希望企業進行消息或廣告轟炸及訪問用戶的個人信息等。這值得使用社會化媒體的企業研究";
 
IKAnalysis(str);
}
 
public static String IKAnalysis(String str) {
StringBuffer sb = new StringBuffer();
try {
// InputStream in = new FileInputStream(str);//
byte[] bt = str.getBytes();// str
InputStream ip = new ByteArrayInputStream(bt);
Reader read = new InputStreamReader(ip);
IKSegmenter iks = new IKSegmenter(read, true);
Lexeme t;
while ((t = iks.next()) != null) {
sb.append(t.getLexemeText() + " , ");
 
}
sb.delete(sb.length() - 1, sb.length());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(sb.toString());
return sb.toString();
 
}
}

運行結果：

加載擴展詞典：com/org/config/my.dic

加載擴展詞典：com/org/config/mine.dic

加載擴展停止詞典：/com/org/config/stopword.dic

最希望 , 從 , 企業 , 得到 , 的 , 是 , 獨家 , 的 , 內容或銷售信息 , 獲得 , 打折 , 或 , 促銷信息 , 等 , 最不 , 希望 , 企業 , 進行 , 消息 , 或 , 廣告 , 轟炸 , 及 , 訪問 , 用戶 , 的 , 個人信息 , 等 , 這 , 值得 , 使用 , 社會化媒體 , 的 , 企業研究 ,

加粗的詞是擴展詞典里的詞。

以下是proj的目錄結構。

附加：手動添加相關詞庫

    public static void main(String[] args) throws IOException {
       String s = "中文分詞工具包";
       Configuration cfg = DefualtConfig.getInstance(); //加載詞庫
       cfg.setUseSmart(true); //設置智能分詞
       Dictionary.initial(cfg);

       Dictionary dictionary = Dictionary.getSingleton();
       // List<String> words = new ArrayList<String>();
       // words.add("基礎班");
       // words.add("高級會計實務");
       // dictionary.addWords(words); //自動添加自定義詞

       System.out.println(cfg.getMainDictionary()); // 系統默認詞庫
       System.out.println(cfg.getQuantifierDicionary());

       Hit hit = dictionary.matchInMainDict("基礎班".toCharArray());
       System.out.println(hit.isMatch());

       System.out.println(queryWords(s));

   }

   /**
   * IK 分詞
   *
   * @param query
   * @return
   * @throws IOException
   */
   public static List<String> queryWords(String query) throws IOException {
       List<String> list = new ArrayList<String>();
       StringReader input = new StringReader(query.trim());

       IKSegmenter ikSeg = new IKSegmenter(input, true);// true　用智能分詞　，false細粒度
       for (Lexeme lexeme = ikSeg.next(); lexeme != null; lexeme = ikSeg.next()) {
           list.add(lexeme.getLexemeText());
       }

       return list;
   }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 IKAnalyzer使用停用詞詞典進行分詞 Lucene使用IKAnalyzer分詞實例及 IKAnalyzer擴展詞庫 Lucene使用IKAnalyzer分詞實例及 IKAnalyzer擴展詞庫 [solr] - IKAnalyzer 擴展分詞庫 Solr4+IKAnalyzer的安裝配置在Solr中配置中文分詞IKAnalyzer 有道詞典的本地/擴展/離線詞庫 GoldenDict（for Linux）配置無道詞典 laravel orm獨立使用 Ubuntu,Linux下goldendict詞典安裝及配置

IKAnalyzer 獨立使用 配置擴展詞典

免責聲明！

IKAnalyzer 獨立使用配置擴展詞典