搜索引擎系列 ---lucene簡介創建索引和搜索初步

本文轉載自查看原文 2014-09-23 02:34 1563 lucene簡介/ 創建索引/ java學習/ lucene 索引/ lucene/ 搜索引擎/ lucene index/ 搜索初步

一、什么是Lucene?

Lucene最初是由Doug Cutting開發的，2000年3月,發布第一個版本，是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎；Lucene得名於Doug妻子的中名，同時這也她外祖母的姓;目前是Apache基金會的一個頂級項目，同時也是學習搜索引擎入門必知必會。

Lucene 是一個 JAVA 搜索類庫，它本身並不是一個完整的解決方案，需要額外的開發工作。

優點：成熟的解決方案，有很多的成功案例。apache 頂級項目，正在持續快速的進步。龐大而活躍的開發社區，大量的開發人員。它只是一個類庫，有足夠的定制和優化空間：經過簡單定制，就可以滿足絕大部分常見的需求；經過優化，可以支持 10億+ 量級的搜索。

缺點：需要額外的開發工作。所有的擴展，分布式，可靠性等都需要自己實現；非實時，從建索引到可以搜索中間有一個時間延遲，而當前的“近實時”(Lucene Near Real Time search)搜索方案的可擴展性有待進一步完善。

對於全文檢索一般都由以下3個部分組成：

索引部分
分詞部分
搜索部分

在接下來的一系列文章中會詳細介紹這三個部分，本文將簡單介紹lucene環境搭建以及lucene索引初步。

目前基於Lucene的產品有：

Solr,Nutch,Hbase,Katta,constellio,Summa,Compass,Bobo Search,Index Tank,Elastic Search,Hadoop contrib/index ,LinkedIn ,Eclipse,Cocoon

二、Lucene環境搭建

目錄最新版的Lucene為4.10.0(今天是2014-09-22 )版，其官方主頁為：http://lucene.apache.org/

或者點擊下載

如果你會使用Maven，那么可以非常簡單的將pom.xml中加入以下內容即可：

    <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.10.0</version>
        </dependency>

如果不會使用Maven，則需要手工下載相應的jar包進行開發。

三、索引創建

1、創建Directory

2、創建IndexWriter

3、創建Document對象

4、為Docuemnt添加Field

5、通過IndexWriter添加文檔到Document

package com.amos.lucene;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.index.IndexableFieldType;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

/**
 * Created by amosli on 14-9-17.
 */
public class HelloLucene {
   static String indexDir = "/home/amosli/developtest/lucene";

    public void index() {
        IndexWriter indexWriter = null;
        FSDirectory directory = null;
        try {
            //1、創建Directory
             directory = FSDirectory.open(new File(indexDir));
            //RAMDirectory directory = new RAMDirectory();

            //2、創建IndexWriter
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_0, new StandardAnalyzer(Version.LUCENE_4_10_0));
            indexWriter = new IndexWriter(directory, indexWriterConfig);

            File file = new File("/home/amosli/developtest/testfile");
            for (File f : file.listFiles()) {

                FieldType fieldType = new FieldType();
                //3、創建Docuemnt對象
                Document document = new Document();

                //4、為Document添加Field
                document.add(new TextField("content", new FileReader(f)) );

                fieldType.setIndexed(true);
                fieldType.setStored(true);
                document.add(new Field("name", f.getName(),fieldType));

                fieldType.setIndexed(false);
                fieldType.setStored(true);
                document.add(new Field("path", f.getAbsolutePath(), fieldType));

                //5、通過IndexWriter添加文檔索引中
                indexWriter.addDocument(document);

            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (indexWriter != null) {
                try {
                    indexWriter.close();

                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

}

注：

　　1、這里使用的是FSDirectory，是為了方便進行測試，將生成的文件寫入到本地硬盤中;

2、Document相當於數據庫中的一條記錄，field相當數據庫中表的一列；

3、使用indexWriter當記錄添加到文檔索引中；

4、fieldType可以設置是否需要索引和是否需要存儲;

5、記得關閉indexWriter

生成的索引文件，如下圖所示：

四、搜索記錄

1、創建Directory

2、創建IndexReader

3、根據IndexReader創建IndexSearcher

4、創建搜索的Query

5、根據Searcher搜索並且返回TopDocs

6、根據TopDocs獲取ScoreDoc對象

7、根據Seacher和ScoreDoc對象獲取具體的Document對象

8、根據Document對象獲取需要的值

 public void search() {
        IndexReader indexReader = null;
        try {
            //1、創建Directory
            FSDirectory directory = FSDirectory.open(new File(indexDir));

            //2、創建IndexReader
             indexReader = DirectoryReader.open(directory);

            //3、根據IndexReader創建IndexSearcher
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            //4、創建搜索的Query
            //創建querypaser來確定要搜索文件的內容，第二個參數表示搜索的域
            QueryParser queryParser = new QueryParser("content", new StandardAnalyzer());
            //創建query，表示搜索域為content中包含java的文檔
            Query query = queryParser.parse("java");
            //5、根據Searcher搜索並且返回TopDocs
            TopDocs topDocs = indexSearcher.search(query, 100);
            //6、根據TopDocs獲取ScoreDoc對象
            ScoreDoc[] sds = topDocs.scoreDocs;
            //7、根據Seacher和ScoreDoc對象獲取具體的Document對象
            for (ScoreDoc sdc : sds) {
                Document doc = indexSearcher.doc(sdc.doc);
                //8、根據Document對象獲取需要的值
                System.out.println("name:" + doc.get("name") + "----->  path:" + doc.get("path"));
            }

        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        }finally{

            if(indexReader!=null){
                try {
                    indexReader.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

輸出結果：

所有源碼：HelloLucene.java

package com.amos.lucene;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

/**
 * Created by amosli on 14-9-17.
 */
public class HelloLucene {
    static String indexDir = "/home/amosli/developtest/lucene";

    public void index() {
        IndexWriter indexWriter = null;
        FSDirectory directory = null;
        try {
            //1、創建Directory
            directory = FSDirectory.open(new File(indexDir));
            //RAMDirectory directory = new RAMDirectory();

            //2、創建IndexWriter
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_0, new StandardAnalyzer(Version.LUCENE_4_10_0));
            indexWriter = new IndexWriter(directory, indexWriterConfig);

            File file = new File("/home/amosli/developtest/testfile");
            for (File f : file.listFiles()) {

                FieldType fieldType = new FieldType();
                //3、創建Docuemnt對象
                Document document = new Document();

                //4、為Document添加Field
                document.add(new TextField("content", new FileReader(f)));

                fieldType.setIndexed(true);
                fieldType.setStored(true);
                document.add(new Field("name", f.getName(), fieldType));

                fieldType.setIndexed(false);
                fieldType.setStored(true);
                document.add(new Field("path", f.getAbsolutePath(), fieldType));

                //5、通過IndexWriter添加文檔索引中
                indexWriter.addDocument(document);

            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (indexWriter != null) {
                try {
                    indexWriter.close();

                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

    public void search() {
        IndexReader indexReader = null;
        try {
            //1、創建Directory
            FSDirectory directory = FSDirectory.open(new File(indexDir));

            //2、創建IndexReader
             indexReader = DirectoryReader.open(directory);

            //3、根據IndexReader創建IndexSearcher
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            //4、創建搜索的Query
            //創建querypaser來確定要搜索文件的內容，第二個參數表示搜索的域
            QueryParser queryParser = new QueryParser("content", new StandardAnalyzer());
            //創建query，表示搜索域為content中包含java的文檔
            Query query = queryParser.parse("java");
            //5、根據Searcher搜索並且返回TopDocs
            TopDocs topDocs = indexSearcher.search(query, 100);
            //6、根據TopDocs獲取ScoreDoc對象
            ScoreDoc[] sds = topDocs.scoreDocs;
            //7、根據Seacher和ScoreDoc對象獲取具體的Document對象
            for (ScoreDoc sdc : sds) {
                Document doc = indexSearcher.doc(sdc.doc);
                //8、根據Document對象獲取需要的值
                System.out.println("name:" + doc.get("name") + "----->  path:" + doc.get("path"));
            }

        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        }finally{

            if(indexReader!=null){
                try {
                    indexReader.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

}

View Code

TestHelloLucene.java

package com.amos.lucene;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.junit.Test;

/**
 * Created by amosli on 14-9-17.
 */
public class TestHelloLucene {
    @Test
    public void testIndex(){
        HelloLucene helloLucene = new HelloLucene();
        helloLucene.index();
    }
    @Test
    public void testSearch(){
        HelloLucene helloLucene = new HelloLucene();
        helloLucene.search();
    }
}

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lucene系列一：搜索引擎核心理論 Lucene搜索引擎入門 Lucene搜索引擎例子demo 搜索引擎1 搜索引擎3 【滲透神器系列】搜索引擎 Nutch搜索引擎系列（目錄）開源搜索引擎評估:lucene sphinx elasticsearch 搜索引擎Solr與Lucene的比較分析【轉】 lucene5.3.1+IKAnalyer 構建簡單搜索引擎