Lucene 學習之二：數值類型的索引和范圍查詢分析

本文轉載自查看原文 2014-12-10 18:20 3808 算法/ LongField/ NumericRangeQuery/ NumericField/ Lucene trie/ Lucene/ Lucene源碼學習

Lucene 使用的是字符（詞）類型的索引結構。對數值類型的索引和存儲最終都要先轉成字符類型。

早期版本Lucene 沒有封裝數值類型的公共類。需要先直接將數字轉成字符串再加到Field 中。

JAVA代碼：

1  Document doc = new Document();
2  long i = 123456L;

3  doc.Add(new Field("id", String.valueOf(i), Field.Store.YES, Field.Index.YES));
4  writer.AddDocument(doc);

如果按上面的方式直接轉換，在進行范圍查詢的時候會有一個問題。

假設現在有123456，123，222 這三個數字，用上面的方式進行存儲過。由於 lucene 索引用結構是基於字符的跳越鏈表。

最終在索引中的排序方式是 123 ，123456，222 。這樣在早期用TermRangeQuery 進行范圍查詢的時候。

結果會把123 ，123456，222 都找出來。為了解決這個問題，一般都采用固定位數，利用字符串排序特點，在不足位補0。

TermRangeQuery tQuery = new TermRangeQuery("id", "123", "222", true, true);//查找[123,222]

分別轉換成：000000123，000123456，00000222 進行存儲。這樣的索引順序變成000000123，000000222，000123456。

查詢時也要做同樣轉換。

TermRangeQuery tQuery = new TermRangeQuery("id", "000000123", "000000222", true, true);

這個做會有兩個性能問題:

1：如果把范圍上下限拆分成多個term 如 000000123，000000124，000000125....000000222 。然后再分別去查詢，把結果集合並。這種會造成查詢次數過多。

2：從起始位置 000000123 遍歷查找到000000222 結束，也會有遍歷次數過多。

后期版本才提供對數值類型的支持，使用NumericField 來實例化一個Field（域）。並提供NumericRangeQuery 針對數值類型的區間查詢的優化方案。

最新的版本（4.0 以上），提供了IntField,LongField FloatField, DoubleField 等，更加細化的數值類型。

索引代碼：

1 Document doc = new Document();
2       
3 LongField idField = new LongField("id", h.getId(),Field.Store.YES);
4 
5 doc.add();
6                     
7 writer.addDocument(doc);

查詢代碼：

  Query q = NumericRangeQuery.newLongRange("idField", 10L, 1000L, true, true);

對數值類型建索引的時候，會把數值轉換成多個 lexicographic sortable string ，然后索引成 trie 字典樹結構。

例如：假設num1 拆解成 a ,ab,abc ;num2 拆解成 a,ab,abd 。

【圖1】：

通過搜索ab 可以把帶ab 前綴的num1,num2 都找出來。在范圍查找的時候，查找范圍內的數值的相同前綴可以達到一次查找返回多個doc 的目的，從而減少查找次數。

下面講解一下：數值類型的索引和范圍查詢的工作原理。

1：數值的二進制表示方式

以long 為例：符號位+63位整數位，符號位0表示正數 1表示負數。

對於正數來說低63位越大這個數越大，對於負數來說也是低63位越大。

如果對符號位取反。則long.min -- long.max 可表示為：0x0000，0000，0000，0000 -- 0xFFFF，FFFF，FFFF，FFFF

經過這樣的轉換后，是不是從字符層面就已經是從小到大排序了？

2：如何拆分前綴

以0x0000，0000，0000，F234為例，每次右移4位。

1：0x0000，0000，0000，F23 與 0x0000，0000，0000,F230 --0x0000，0000，0000，F23F 范圍內的所有數值的前綴一是一致的

2：0x0000，0000，0000，F2 與 0x0000，0000，0000,F200 ——0x0000，0000，0000，F2FF 范圍內的所有數值的前綴一致

3：0x0000，0000，0000，F 與 0x0000，0000，0000,F000 --0x0000，0000，0000，FFFF 范圍內的所有數值的前綴一致

....

0x0

如果用右移幾位后的值做key，可以代表一個相應的范圍。key可以理解成數值的前綴

3：對大范圍折成小范圍

Lucene 在查詢時候的法做法是對大范圍折成小范圍，然后每個小范圍分別用前綴進行查找，從而減少查找次數。

4：數值類型的索引的實現

先設定一個PrecisionStep (默認4)，對數值類型每次右移（n-1）* PrecisionStep 位。

每次移位后，從左邊開始每7位存入一個byte，組成一個byte[]，

並且在數組第0位插入一個特殊byte，標識這次的偏移量。

每個byte[]可以轉成一個lexicographic sortable string。

lexicographic sortable string 的字符按字典序排列后，和偏移量，數值的大小順序是一致的。——這個是NumericRangeQuery 范圍查找的關鍵！

long 類型一共64位，如果precisionStep=4，則會有16個lexicographic sortable string。

相當於16個前綴對應一個long數值，再用lucene 的倒序索引，最終索引成類似【圖1】的那種索引結構。

拆分的關鍵代碼：

org.apache.lucene.util.NumericUtils 類的 longToPrefixCodedBytes() 方法

 1   public static void longToPrefixCodedBytes(final long val, final int shift, final BytesRefBuilder bytes) {

 2     if ((shift & ~0x3f) != 0)  // ensure shift is 0..63
 3       throw new IllegalArgumentException("Illegal shift value, must be 0..63");
       //計算byte[]的大小，每位七位存入一個byte
 4     int nChars = (((63-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
       //最后還有第0位存偏移量，所以+1
 5     bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
 6     bytes.grow(BUF_SIZE_LONG);
       //標識偏移量，shift
 7     bytes.setByteAt(0, (byte)(SHIFT_START_LONG + shift));
       //把符號位取反
 8     long sortableBits = val ^ 0x8000000000000000L;
       //右移shift位,第一次shifi傳0，之后按precisionStep遞增
 9     sortableBits >>>= shift;
10     while (nChars > 0) {
11       // Store 7 bits per byte for compatibility
12       // with UTF-8 encoding of terms
         //每7位存入一上byte ，前面第一位為0——在utf8中表示ascii碼.並加到數組中。
13       bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
14       sortableBits >>>= 7;
15     }
16   }

5：范圍查詢

大致思想是從范圍的兩端開始拆分。先把低位的值拆成一個區間，再移動PrecisionStep到下一個高位又並成一個區間。

最后把小區間里每個值，按移動的次數，用和索引的同樣方式轉成lexicographic sortable string.進行查找。

代碼:

org.apache.lucene.util.NumericUtils 類的 splitRange() 方法

 1 private static void splitRange(
 2     final Object builder, final int valSize,
 3     final int precisionStep, long minBound, long maxBound
 4   ) {
 5     if (precisionStep < 1)
 6       throw new IllegalArgumentException("precisionStep must be >=1");
 7     if (minBound > maxBound) return;
 8     for (int shift=0; ; shift += precisionStep) {
 9       // calculate new bounds for inner precision
10       final long diff = 1L << (shift+precisionStep),
11         mask = ((1L<<precisionStep) - 1L) << shift;
12       final boolean
13         hasLower = (minBound & mask) != 0L,
14         hasUpper = (maxBound & mask) != mask;
15       final long
16         nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask,
17         nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
18       final boolean
19         lowerWrapped = nextMinBound < minBound,
20         upperWrapped = nextMaxBound > maxBound;
21       
22       if (shift+precisionStep>=valSize || nextMinBound>nextMaxBound || lowerWrapped || upperWrapped) {
23         // We are in the lowest precision or the next precision is not available.
24         addRange(builder, valSize, minBound, maxBound, shift);
25         // exit the split recursion loop
26         break;
27       }
28       
29       if (hasLower)
30         addRange(builder, valSize, minBound, minBound | mask, shift);
31       if (hasUpper)
32         addRange(builder, valSize, maxBound & ~mask, maxBound, shift);
33       
34       // recurse to next precision
35       minBound = nextMinBound;
36       maxBound = nextMaxBound;
37     }
38   }

例如：1001,0001-1111,0010 分步拆分成

1: 1001,0001-1001,1111 （第0次偏移后 0x91-0x9F 有15個term ）

和 1111,0000 -1111,0010 （第0次偏移后 0xF0-0F2 有3個term ）

2: 1002,0000 – 1110,1111 右移一次后（0x11- 0x15 有5個term ）

查找23個lexicographic sortable string.就可以覆蓋整個區間。

官方說明：

http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/NumericRangeQuery.html

On the other hand, if the precisionStep is smaller, the maximum number of terms to match reduces, which optimizes query speed. The formula to calculate the maximum number of terms that will be visited while executing the query is:

$\mathrm{maxQueryTerms} = \left[ \left( \mathrm{indexedTermsPerValue} - 1 \right) \cdot \left(2^\mathrm{precisionStep} - 1 \right) \cdot 2 \right] + \left( 2^\mathrm{precisionStep} - 1 \right)$

For longs stored using a precision step of 4, maxQueryTerms = 15*15*2 + 15 = 465, and for a precision step of 2, maxQueryTerms = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking in the term enum of the index. Because of this, the ideal precisionStep value can only be found out by testing. Important: You can index with a lower precision step value and test search speed using a multiple of the original step value.

http://lucene.apache.org/core/4_10_2/core/index.html

To sort according to a LongField, use the normal numeric sort types, eg SortField.Type.LONG.

If you only need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep of Integer.MAX_VALUE.

如果這個數值只是用來當作sort 字段，不需要范圍查詢。排序時指定排序Type SortField.Type.LONG.

可以將precisionStep=Integer.MAX_VALUE。這樣就只會產生0偏移的lexicographic sortable string減少索引體積。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Lucene的數值索引以及范圍查詢 Lucene的數值索引以及范圍查詢基於Lucene的查詢索引 Lucene查詢索引（分頁） Lucene.net(4.8.0) 學習問題記錄六：Lucene 的索引系統和搜索過程分析 Lucene索引，查詢及高亮顯示 MySQL和Lucene索引對比分析【Lucene】Lucene 學習之索引文件結構 Lucene之模糊、精確、匹配、范圍、多條件查詢 lucene為數據庫數據創建索引並查詢索引庫