Faiss教程：索引(1)

本文轉載自查看原文 2018-07-16 09:49 11339 faiss

索引是faiss的關鍵知識，我們重點介紹下。

索引方法匯總

有些索引名，我就不翻譯了，根據英文名去學習更准確。

索引名	類名	index_factory	主要參數	字節數/向量	精准檢索	備注
精准的L2搜索	IndexFlatL2	"Flat"	d	4*d	yes	brute-force
精准的內積搜索	IndexFlatIP	"Flat"	d	4*d	yes	歸一化向量計算cos
Hierarchical Navigable Small World graph exploration	IndexHNSWFlat	"HNSWx,Flat"	d, M	4d + 8 M	no	-
倒排文件	IndexIVFFlat	"IVFx,Flat"	quantizer, d, nlists, metric	4*d	no	需要另一個量化器來建立倒排
Locality-Sensitive Hashing (binary flat index)	IndexLSH	-	d, nbits	nbits/8	yes	optimized by using random rotation instead of random projections
Scalar quantizer (SQ) in flat mode	IndexScalarQuantizer	"SQ8"	d	d	yes	每個維度項可以用4 bit表示，但是精度會受到一定影響
Product quantizer (PQ) in flat mode	IndexPQ	"PQx"	d, M, nbits	M (if nbits=8)	yes	-
IVF and scalar quantizer	IndexIVFScalarQuantizer	"IVFx,SQ4" "IVFx,SQ8"	quantizer, d, nlists, qtype	d or d/2	no	有兩種編碼方式：每個維度項4bit或8bit
IVFADC (coarse quantizer+PQ on residuals)	IndexIVFPQ	"IVFx,PQy"	quantizer, d, nlists, M, nbits	M+4 or M+8	no	內存和數據id（int、long）相關，目前只支持 nbits <= 8
IVFADC+R (same as IVFADC with re-ranking based on codes)	IndexIVFPQR	"IVFx,PQy+z"	quantizer, d, nlists, M, nbits, M_refine, nbits_refine	M+M_refine+4 or M+M_refine+8	no	-

Cell-probe方法

加速查找的典型方法是對數據集進行划分，我們采用了基於Multi-probing(best-bin KD樹變體)的分塊方法。

特征空間被切分為ncells個塊
數據被划分到這些塊中（k-means可根據最近歐式距離），歸屬關系存儲在ncells個節點的倒排列表中
搜索時，檢索離目標距離最近的nprobe個塊
根據倒排列表檢索nprobe個塊中的所有數據。

這便是IndexIVFFlat，它需要另一個索引來記錄倒排列表。

IndexIVFKmeans 和 IndexIVFSphericalKmeans 不是對象而是方法，它們可以返回IndexIVFFlat對象。

注意：對於高維的數據，要達到較好的召回，需要的nprobes可能很大

和LSH的關系

最流行的cell-probe方法可能是原生的LSH方法，可參考E2LSH。然而，這個方法及其變體有兩大弊端：

需要大量的哈希函數（=分塊數），來達到可以接受的結果
哈希函數很難基於輸入動態調整，實際應用中容易返回次優結果

LSH的示例

n_bits = 2 * d
lsh = faiss.IndexLSH (d, n_bits)
lsh.train (x_train)
lsh.add (x_base)
D, I = lsh.search (x_query, k)

d是輸入數據的維度，nbits是存儲向量的bits數目。

PQ的示例

m = 16                                   # number of subquantizers
n_bits = 8                               # bits allocated per subquantizer
pq = faiss.IndexPQ (d, m, n_bits)        # Create the index
pq.train (x_train)                       # Training
pq.add (x_base)                          # Populate the index
D, I = pq.search (x_query, k)            # Perform a search

帶倒排的PQ：IndexIVFPQ

coarse_quantizer = faiss.IndexFlatL2 (d)
index = faiss.IndexIVFPQ (coarse_quantizer, d,
                          ncentroids, m, 8)
index.nprobe = 5

復合索引

使用PQ作粗粒度量化器的Cell Probe方法

相應的文章見：The inverted multi-index, Babenko & Lempitsky, CVPR'12。在Faiss中可使用MultiIndexQuantizer，它不需要add任何向量，因此將它應用在IndexIVF時需要設置quantizer_trains_alone。

nbits_mi = 12  # c
M_mi = 2       # m
coarse_quantizer_mi = faiss.MultiIndexQuantizer(d, M_mi, nbits_mi)
ncentroids_mi = 2 ** (M_mi * nbits_mi)

index = faiss.IndexIVFFlat(coarse_quantizer_mi, d, ncentroids_mi)
index.nprobe = 2048
index.quantizer_trains_alone = True

預過濾PQ編碼，漢明距離的計算比PQ距離計算快6倍，通過對PQ中心的合理重排序，漢明距離可以正確地替代PQ編碼距離。在搜索時設置漢明距離的閾值，可以避免PQ比較的大量運算。

# IndexPQ
index = faiss.IndexPQ (d, 16, 8)
# before training
index.do_polysemous_training = true
index.train (...)

# before searching
index.search_type = faiss.IndexPQ.ST_polysemous
index.polysemous_ht = 54    # the Hamming threshold
index.search (...)

# IndexIVFPQ
index = faiss.IndexIVFPQ (coarse_quantizer, d, 16, 8)
# before training
index. do_polysemous_training = true
index.train (...)

# before searching
index.polysemous_ht = 54 # the Hamming threshold
index.search (...)

閾值設定是注意兩點：

閾值在0到編碼bit數（16*8）之間
閾值越小，留下的需要計算的PQ中心數越少，推薦<1/2*bits

復合索引中也可以建立多級PQ量化索引。

預處理和后處理

為了獲得更好的索引，可以remap向量ids，對數據集進行變換，re-rank檢索結果等。

Faiss id mapping

默認情況下，Faiss為每個向量設置id。有些Index實現了add_with_ids方法，為向量添加64bit的ids，檢索時返回ids而不需返回原始向量。

index = faiss.IndexFlatL2(xb.shape[1]) 
ids = np.arange(xb.shape[0])
index.add_with_ids(xb, ids)  # this will crash, because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # works, the vectors are stored in the underlying index

IndexIVF原生提供了ass_with_ids方法，就不需要IndexIDMap了。

預變換

變換方法	類名	備注
random rotation	RandomRotationMatrix	useful to re-balance components of a vector before indexing in an IndexPQ or IndexLSH
remapping of dimensions	RemapDimensionsTransform	為適應索引推薦的維度，通過重排列減少或增加向量維度d
PCA	PCAMatrix	降維
OPQ rotation	OPQMatrix	OPQ通過旋轉輸入向量更利於PQ編碼，見 Optimized product quantization, Ge et al., CVPR'13

換行可以通過train進行訓練，通過apply應用到數據上。這些變化可以通過IndexPreTransform方法應用到索引上。

# the IndexIVFPQ will be in 256D not 2048
coarse_quantizer = faiss.IndexFlatL2 (256)
sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, ncoarse, 16, 8)
# PCA 2048->256
# also does a random rotation after the reduction (the 4th argument)
pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) 

#- the wrapping index
index = faiss.IndexPreTransform (pca_matrix, sub_index)

# will also train the PCA
index.train(...)
# PCA will be applied prior to addition
index.add(...)

IndexRefineFlat

對搜索結果進行精准重排序

q = faiss.IndexPQ (d, M, nbits_per_index)
rq = faiss.IndexRefineFlat (q)
rq.train (xt)
rq.add (xb)
rq.k_factor = 4
D, I = rq:search (xq, 10)

從IndexPQ的最近4*10個鄰域中，計算真實距離，返回最好的10個結果。注意IndexRefineFlat需要積累全向量，占用內存較高。

IndexShards

如果數據分開為多個索引，查詢時需要合並結果集。這在多GPU以及平行查詢中是必需的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Faiss教程：索引(2) faiss原理----索引 Faiss教程：入門 Faiss教程：基礎 Faiss教程：GPU milvus和faiss安裝及其使用教程【4】facebook大數據搜索庫faiss使用——Faiss索引介紹 faiss索引基於數量級和內存限制的選擇 faiss筆記 Faiss學習：一