Lucene使用案例

本文轉載自查看原文 2018-04-03 10:54 5192 實際開發

Lucene是apache軟件基金會4 jakarta項目組的一個子項目，是一個開放源代碼的全文檢索引擎工具包，但它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎，部分文本分析引擎（英文與德文兩種西方語言）。Lucene的目的是為軟件開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索的功能，或者是以此為基礎建立起完整的全文檢索引擎。Lucene是一套用於全文檢索和搜尋的開源程式庫，由Apache軟件基金會支持和提供。Lucene提供了一個簡單卻強大的應用程式接口，能夠做全文索引和搜尋。在Java開發環境里Lucene是一個成熟的免費開源工具。就其本身而言，Lucene是當前以及最近幾年最受歡迎的免費Java信息檢索程序庫。人們經常提到信息檢索程序庫，雖然與搜索引擎有關，但不應該將信息檢索程序庫與搜索引擎相混淆(來自百度百科)

點擊查看百度百科：https://baike.baidu.com/item/Lucene

CSDN一篇文章介紹：https://blog.csdn.net/regan_hoo/article/details/78802897

關於Lucene具體介紹不多說，這里寫一個應用場景來學會使用Lucene：

我在一個文件夾里面存了一堆txt文本，大小不一，名字不同，里面的內容不同，我要做一個功能實現：

比如我輸入一個java，只要名字，路徑，內容等等出現了java這個詞匯，就返回結果給我

1.建一個java工程，導包（開發中通常是使用老版本的）：

2.一步一步來：

我在D盤666文件夾放入一堆文本文件

然后在D盤的temp的index下創建索引庫

創建索引：

package lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

public class MyFirstLucene {
    // 創建索引
    @Test
    public void testIndex() throws Exception {
        // 新建一個索引庫（我放在D盤某文件夾內）
        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
        // 新建分析器對象
        Analyzer analyzer = new StandardAnalyzer();
        // 新建配置對象
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        // 創建一個IndexWriter對象（參數一個索引庫，一個配置）
        IndexWriter indexWriter = new IndexWriter(directory, config);
        // 創建域對象
        File f = new File("D:\\666");
        File[] list = f.listFiles();
        for (File file : list) {
            // 創建一個文檔對象
            Document document = new Document();
            // 文件名稱
            String file_name = file.getName();
            Field fileNameField = new TextField("fileName", file_name, Store.YES);
            // 文件大小
            long file_size = FileUtils.sizeOf(file);
            Field fileSizeField = new LongField("fileSize", file_size, Store.YES);
            // 文件路徑
            String file_path = file.getPath();
            Field filePathField = new StoredField("filePath", file_path);
            // 文件內容
            String file_content = FileUtils.readFileToString(file);
            Field fileContentField = new TextField("fileContent", file_content, Store.YES);

            // 添加到document
            document.add(fileNameField);
            document.add(fileSizeField);
            document.add(filePathField);
            document.add(fileContentField);

            // 創建索引
            indexWriter.addDocument(document);
        }

        // 關閉資源
        indexWriter.close();

    }
}

好的，運行成功，我打開D盤的temp文件夾中的Index文件夾：存入了一堆看不懂的東西，這就代表創建索引庫成功：

3.查詢索引：

    // 搜索索引
    @Test
    public void testSearch() throws Exception {
        // 第一步：創建一個Directory對象，也就是索引庫存放的位置。
        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
        // 第二步：創建一個indexReader對象，需要指定Directory對象。
        IndexReader indexReader = DirectoryReader.open(directory);
        // 第三步：創建一個indexSearcher對象，需要指定IndexReader對象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        // 第四步：創建一個TermQuery對象，指定查詢的域和查詢的關鍵詞。
        Query query = new TermQuery(new Term("fileName", "spring"));
        // 第五步：執行查詢（顯示條數）
        TopDocs topDocs = indexSearcher.search(query, 10);
        // 第六步：返回查詢結果。遍歷查詢結果並輸出。
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            int doc = scoreDoc.doc;
            Document document = indexSearcher.doc(doc);
            // 文件名稱
            String fileName = document.get("fileName");
            System.out.println(fileName);
            // 文件內容
            String fileContent = document.get("fileContent");
            System.out.println(fileContent);
            // 文件大小
            String fileSize = document.get("fileSize");
            System.out.println(fileSize);
            // 文件路徑
            String filePath = document.get("filePath");
            System.out.println(filePath);
            System.out.println("------------");
        }
        // 第七步：關閉IndexReader對象
        indexReader.close();

    }

實現了基礎功能

但是，這里有很大的一個問題：

無法處理中文

4.所以，接下來就處理中文問題：中文分析器（上邊的示例采用的是標准分析器，即處理英文的分析器）

首先，我們看看用標准分析器分析中文的結果：

    // 查看分析器的分詞效果
    @Test
    public void testTokenStream() throws Exception {
        // 創建一個分析器對象
        Analyzer analyzer = new StandardAnalyzer();// 獲得tokenStream對象
        // 第一個參數：域名，可以隨便給一個
        // 第二個參數：要分析的文本內容
        TokenStream tokenStream = analyzer.tokenStream("test",
                "高富帥可以用二維表結構來邏輯表達實現的數據");
        // 添加一個引用，可以獲得每個關鍵詞
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        // 添加一個偏移量的引用，記錄了關鍵詞的開始位置以及結束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        // 將指針調整到列表的頭部
        tokenStream.reset();
        // 遍歷關鍵詞列表，通過incrementToken方法判斷列表是否結束
        while (tokenStream.incrementToken()) {
            // 關鍵詞的起始位置
            System.out.println("start->" + offsetAttribute.startOffset());
            // 取關鍵詞
            System.out.println(charTermAttribute);
            // 結束位置
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

結果：

這里截取了一部分，發現如果采用了標准分析器，每一個中文都分隔開了，顯然有問題

於是不能采用標准分析器：

用一個SmartChinese分析器：

    // 查看分析器的分詞效果
    @Test
    public void testTokenStream() throws Exception {
        Analyzer analyzer = new SmartChineseAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("test",
                "高富帥可以用二維表結構來邏輯表達實現的數據");
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.println("start->" + offsetAttribute.startOffset());
            System.out.println(charTermAttribute);
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

效果有提升：

不過有一個問題：對於新詞匯如高富帥這樣的，它不識別

如果追求更好的效果：可以采用其他的第三方分析器

比如我這里采用一個IK分析器，可以自己添加進去“高富帥”這種詞匯

    @Test
    public void testTokenStream() throws Exception {
        Analyzer analyzer = new IKAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("test",
                "高富帥可以用二維表結構來邏輯表達實現的數據");
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.println("start->" + offsetAttribute.startOffset());
            System.out.println(charTermAttribute);
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

添加配置文件

IKAnalyzer.cfg.xml：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶可以在這里配置自己的擴展字典 -->
    <entry key="ext_dict">ext.dic;</entry> 
    
    <!--用戶可以在這里配置自己的擴展停止詞字典-->
    <entry key="ext_stopwords">stopword.dic;</entry> 
    
</properties>

ext.doc:

高富帥
二維表

stopword.dic:

我
是
用
的
二
維
表
來
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

View Code

這時候效果：

解決了中文問題，並且可以擴展

上邊對於索引庫操作只有添加，我們還可以對索引庫做其他操作：

查詢的時候也可以有多種方式，下面代碼示例：

package lucene;

import java.io.File;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

/**
 * 索引維護 添加 （上邊已完成） 刪除 修改 查詢
 */
public class LuceneManager {
    public IndexWriter getIndexWriter() throws Exception {
        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
        return new IndexWriter(directory, config);
    }

    // 全刪除
    @Test
    public void testAllDelete() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        indexWriter.deleteAll();
        indexWriter.close();
    }

    // 根據條件刪除
    @Test
    public void testDelete() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        Query query = new TermQuery(new Term("fileName", "apache"));
        indexWriter.deleteDocuments(query);
        indexWriter.close();
    }

    // 修改
    @Test
    public void testUpdate() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        Document doc = new Document();
        doc.add(new TextField("fileN", "測試文件名", Store.YES));
        doc.add(new TextField("fileC", "測試文件內容", Store.YES));
        indexWriter.updateDocument(new Term("fileName", "lucene"), doc, new IKAnalyzer());
        indexWriter.close();
    }

    public IndexSearcher getIndexSearcher() throws Exception {
        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));
        IndexReader indexReader = DirectoryReader.open(directory);
        return new IndexSearcher(indexReader);
    }

    // 執行查詢的結果
    public void printResult(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            int doc = scoreDoc.doc;
            Document document = indexSearcher.doc(doc);
            // 文件名稱
            String fileName = document.get("fileName");
            System.out.println(fileName);
            // 文件內容
            String fileContent = document.get("fileContent");
            System.out.println(fileContent);
            // 文件大小
            String fileSize = document.get("fileSize");
            System.out.println(fileSize);
            // 文件路徑
            String filePath = document.get("filePath");
            System.out.println(filePath);
            System.out.println("------------");
        }
    }

    // 查詢所有
    @Test
    public void testMatchAllDocsQuery() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        Query query = new MatchAllDocsQuery();
        System.out.println(query);
        printResult(indexSearcher, query);
        // 關閉資源
        indexSearcher.getIndexReader().close();
    }

    // 根據數值范圍查詢
    @Test
    public void testNumericRangeQuery() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        // 參數的意思：文本大小在100到200字節之間，不包含100，包含200
        Query query = NumericRangeQuery.newLongRange("fileSize", 100L, 200L, false, true);
        System.out.println(query);
        printResult(indexSearcher, query);
        // 關閉資源
        indexSearcher.getIndexReader().close();
    }

    // 可以組合查詢條件
    @Test
    public void testBooleanQuery() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        BooleanQuery booleanQuery = new BooleanQuery();
        Query query1 = new TermQuery(new Term("fileName", "apache"));
        Query query2 = new TermQuery(new Term("fileName", "lucene"));
        // 類似 select * from user where id = ? or/and name = ?
        booleanQuery.add(query1, Occur.MUST);// and
        booleanQuery.add(query2, Occur.SHOULD);// or
        System.out.println(booleanQuery);
        printResult(indexSearcher, booleanQuery);
        // 關閉資源
        indexSearcher.getIndexReader().close();
    }

    // 條件解釋的對象查詢(上邊的查詢和這種掌握一種即可)
    @Test
    public void testQueryParser() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        QueryParser queryParser = new QueryParser("fileName", new IKAnalyzer());
        // *:* 域：值
        Query query = queryParser.parse("fileName:lucene OR fileContent:apache");
        printResult(indexSearcher, query);
        // 關閉資源
        indexSearcher.getIndexReader().close();
    }

    // 條件解析的對象查詢 多個默認域
    @Test
    public void testMultiFieldQueryParser() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        String[] fields = { "fileName", "fileContent" };
        MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, new IKAnalyzer());
        Query query = queryParser.parse("lucene");
        printResult(indexSearcher, query);
        // 關閉資源
        indexSearcher.getIndexReader().close();
    }

}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lucene 04 - 學習使用Lucene的Field(字段) Lucene6.6.0 案例與學習路線 lucene的使用與優化 Lucene使用IKAnalyzer分詞 Lucene.Net使用探秘 Lucene 02 - Lucene的入門程序(Java API的簡單使用) 使用Lucene索引和檢索POI數據 Lucene.net 原理介紹以及使用方法小菜學習Lucene.Net(更新3.0.3版本使用) lucene 7.x 分詞 TokenStream的使用及源碼分析