Lucene.net(4.8.0) 學習問題記錄一:分詞器Analyzer的構造和內部成員ReuseStategy

本文轉載自查看原文 2017-12-13 23:15 1095 Analyzer/ Lucene/ PanGu

前言：目前自己在做使用Lucene.net和PanGu分詞實現全文檢索的工作，不過自己是把別人做好的項目進行遷移。因為項目整體要遷移到ASP.NET Core 2.0版本,而Lucene使用的版本是3.6.0 ，PanGu分詞也是對應Lucene3.6.0版本的。不過好在Lucene.net 已經有了Core 2.0版本，4.8.0 bate版，而PanGu分詞，目前有人正在做，貌似已經做完，只是還沒有測試~，Lucene升級的改變我都會加粗表示。

Lucene.net 4.8.0

https://github.com/apache/lucenenet

PanGu分詞(可以直接使用的)

https://github.com/SilentCC/Lucene.Net.Analysis.PanGu

JIEba分詞(可以直接使用的)

https://github.com/SilentCC/JIEba-netcore2.0

Lucene.net 4.8.0 和之前的Lucene.net 3.6.0 改動還是相當多的，這里對自己開發過程遇到的問題，做一個記錄吧，希望可以幫到和我一樣需要升級Lucene.net的人。我也是第一次接觸Lucene ,也希望可以幫助初學Lucene的同學。

一，Lucene 分詞器：Analyzer

這里就對Lucene的Analyzer做一個簡單的闡述，以后會對Analyzer做一個更加詳細的筆記：Lucene 中的Analyzer 是一個分詞器，具體的作用呢就是將文本（包括要寫入索引的文檔，和查詢的條件）進行分詞操作 Tokenization 得到一系列的分詞 Token。我們用的別的分詞工具，比如PanGu分詞，都是繼承Analyzer 的，並且繼承相關的類和覆寫相關的方法。Analyzer 是怎么參與搜索的過程呢？

1.在寫入索引的時候：

我們需要IndexWriter ,二IndexWriter 的構建，補充一下，Lucene3.6.0 的構造方法已經被拋棄了，新的構造方法是，依賴一個IndexWriterConfig 類，這記錄的是IndexWriter 的各種屬性和配置，這里不做細究了。IndexWriterConfig 的構造函數就要傳入一個Analyzer .

IndexWriterConfig(Version matchVersion, Analyzer analyzer)

所以我們寫入索引的時候，會用到Analyzer , 寫入的索引是這樣一個借口，索引的儲存方式是Document 類，一個Document類中有很多的Field (name, value)。我們可以這樣理解Document是是一個數據庫中的表，Field是數據庫的中的字段。比如一篇文章，我們要把它存入索引，以便后來有人可以搜索到。

文章有很多屬性：Title : xxx ; Author :xxxx;Content : xxxx;

document.Add(new Field("Title","Lucene"));
document.Add(new Field("Author","dacc"));
document.Add(new Field("Content","xxxxxx"));
IndexWriter.AddDocument(document);

大抵是上面的過程，而分詞器Analyzer需要做的就是Filed 的value進行分詞，把很長的內容分成一個一個的小分詞 Token。

2.在查詢搜索的時候，

我們也需要Analyzer ,當然不是必須需要，和IndexWriter的必須要求不一樣。Analyzer的職責就是，將查詢的內容進行分詞，比如我們查詢的內容是 “全文檢索和分詞” ，那么Analyzer會把它先分解成“全文檢索”和“分詞”，然后在索引中，去找和有這些分詞的Field ,然后把Field所在的Document，返回出去。這里搜索的細節在這里不細究了，以后也會做詳細的筆記。

二，問題：

大概了解了Analyzer之后，我就列出我遇到的問題：

1.在調用Analyer的GetTokenStream 之后，拋出

Object reference not set to an instance of an object

這個異常的意思是，引用了值為null的對象。於是我去翻找源碼，發現

  public TokenStream GetTokenStream(string fieldName, TextReader reader)
        {
            TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);
            TextReader r = InitReader(fieldName, reader);
            if (components == null)
            {
                components = CreateComponents(fieldName, r);
                reuseStrategy.SetReusableComponents(this, fieldName, components);
            }
            else
            {
                components.SetReader(r);
            }
            return components.TokenStream;
        }

在下面這條語句上面拋出了錯誤：

    TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);

reuseStrategy 是一個空對象。所以這句就報錯了。這里，我們可以了解一下，Analyzer的內部.函數 GetTokenStream 是返回Analyzer中的TokenStream，TokenStream是一系列Token的集合。先不細究TokenStream的具體作用，因為會花很多的篇幅去說。而獲取TokenStream 的關鍵就在reuseStrategy 。在新版本的Lucene中，Analyzer中TokenStream是可以重復使用的，即在一個線程中建立的Analyzer實例，都共用TokenStream。

 internal DisposableThreadLocal<object> storedValue = new DisposableThreadLocal<object>();

Analyzer的成員 storedValue 是全局共用的，storedValue 中就儲存了TokenStream 。而reuseStrategy也是Lucene3.6.0中沒有的 的作用就是幫助實現，多個Analyzer實例共用storedValue 。ResuseStrategy類中有成員函數GetReusableComponents 和SetReusableComponents 是設置TokenStream和Tokenizer的，

這是ResueStrategy類的源碼，這個類是一個抽象類，Analyzer的內部類，

 public abstract class ReuseStrategy
    {
        /// <summary>
        /// Gets the reusable <see cref="TokenStreamComponents"/> for the field with the given name.
        /// </summary>
        /// <param name="analyzer"> <see cref="Analyzer"/> from which to get the reused components. Use
        ///        <see cref="GetStoredValue(Analyzer)"/> and <see cref="SetStoredValue(Analyzer, object)"/>
        ///        to access the data on the <see cref="Analyzer"/>. </param>
        /// <param name="fieldName"> Name of the field whose reusable <see cref="TokenStreamComponents"/>
        ///        are to be retrieved </param>
        /// <returns> Reusable <see cref="TokenStreamComponents"/> for the field, or <c>null</c>
        ///         if there was no previous components for the field </returns>
        public abstract TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName);

        /// <summary>
        /// Stores the given <see cref="TokenStreamComponents"/> as the reusable components for the
        /// field with the give name.
        /// </summary>
        /// <param name="analyzer"> Analyzer </param>
        /// <param name="fieldName"> Name of the field whose <see cref="TokenStreamComponents"/> are being set </param>
        /// <param name="components"> <see cref="TokenStreamComponents"/> which are to be reused for the field </param>
        public abstract void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components);

        /// <summary>
        /// Returns the currently stored value.
        /// </summary>
        /// <returns> Currently stored value or <c>null</c> if no value is stored </returns>
        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>
        protected internal object GetStoredValue(Analyzer analyzer)
        {
            if (analyzer.storedValue == null)
            {
                throw new ObjectDisposedException(this.GetType().GetTypeInfo().FullName, "this Analyzer is closed");
            }
            return analyzer.storedValue.Get();
        }

        /// <summary>
        /// Sets the stored value.
        /// </summary>
        /// <param name="analyzer"> Analyzer </param>
        /// <param name="storedValue"> Value to store </param>
        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>
        protected internal void SetStoredValue(Analyzer analyzer, object storedValue)
        {
            if (analyzer.storedValue == null)
            {
                throw new ObjectDisposedException("this Analyzer is closed");
            }
            analyzer.storedValue.Set(storedValue);
        }
    }

Analyzer 中的另一個內部類，繼承了ReuseStrategy 抽象類。這兩個類實現了設置Analyzer中的TokenStreamComponents和獲取TokenStreamComponents 。這樣的話Analyzer中GetTokenStream流程就清楚了

    public sealed class GlobalReuseStrategy : ReuseStrategy
        {
            /// <summary>
            /// Sole constructor. (For invocation by subclass constructors, typically implicit.) </summary>
            [Obsolete("Don't create instances of this class, use Analyzer.GLOBAL_REUSE_STRATEGY")]
            public GlobalReuseStrategy()
            { }


            public override TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName)
            {
                return (TokenStreamComponents)GetStoredValue(analyzer);
            }


            public override void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components)
            {
                SetStoredValue(analyzer, components);
            }
        }

另外呢Analyzer 也可以設置TokenStream:

 public TokenStream GetTokenStream(string fieldName, TextReader reader)
                    {
                        //先獲取上一次共用的TokenStreamComponents
                        TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);
                        TextReader r = InitReader(fieldName, reader);
                        //如果沒有，就需要自己創建一個
                        if (components == null)
                        {
                            components = CreateComponents(fieldName, r);
                            //並且設置新的ResuableComponents，可以讓下一個使用
                            reuseStrategy.SetReusableComponents(this, fieldName, components);
                        }
                        else
                        {
                            //如果之前就生成過了，TokenStreamComponents,則reset
                            components.SetReader(r);
                        }
                        //返回TokenStream
                        return components.TokenStream;
                    }

所以我們在調用Analyzer的時候，Analyzer有一個構造函數

  public Analyzer(ReuseStrategy reuseStrategy)
        {
            this.reuseStrategy = reuseStrategy;
        }

設置Analyzer 的 ReuseStrategy , 然后我發現在PanGu分詞中，使用的構造函數中並沒有傳入ReuseStrategy , 按我們就需要自己建一個ReuseStrategy的實例。

PanGu分詞的構造函數：

 public PanGuAnalyzer(bool originalResult)
          : this(originalResult, null, null)
        {
        }

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)
            : this(false, options, parameters)
        {
        }

      
        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters)
            : base()
        {
            this.Initialize(originalResult, options, parameters);
        }

       
       
        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters, ReuseStrategy reuseStrategy)
            : base(reuseStrategy)
        {
            this.Initialize(originalResult, options, parameters);
        }

        protected virtual void Initialize(bool originalResult, MatchOptions options, MatchParameter parameters)
        {
            _originalResult = originalResult;
            _options = options;
            _parameters = parameters;
        }

我調用的是第二個構造函數，結果傳進去的ReuseStrategy 是null ,所以我們需要新建實例，事實上Analyzer中已經為我們提供了：

public static readonly ReuseStrategy GLOBAL_REUSE_STRATEGY = new GlobalReuseStrategy()

所以稍微改動一下PanGu分詞的構造函數就好了：

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)
            : this(false, options, parameters, Lucene.Net.Analysis.Analyzer.GLOBAL_REUSE_STRATEGY)
        {
        }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lucene.net(4.8.0) 學習問題記錄二: 分詞器Analyzer中的TokenStream和AttributeSource Lucene.net(4.8.0) 學習問題記錄五: JIEba分詞和Lucene的結合，以及對分詞器的思考 Lucene.net(4.8.0) 學習問題記錄六：Lucene 的索引系統和搜索過程分析 Lucene.net(4.8.0) 學習問題記錄三: 索引的創建 IndexWriter 和索引速度的優化 Net Core使用Lucene.Net和盤古分詞器實現全文檢索 Lucene.Net + 盤古分詞 Lucene.net入門學習（結合盤古分詞） es的分詞器analyzer Lucene的中文分詞器 Lucene:Ansj分詞器