Faiss向量相似性搜索

本文轉載自查看原文 2020-04-02 22:06 2301 faiss/ Faiss

官網地址，你也可以訪問我的Github，運行代碼。

本文是基於官網整理，為了防止理解偏差或簡化，有些直接用英文。另外，也加了一點自己的理解。

數據准備

Faiss可以處理固定維度d的向量集合，typically a few 10s to 100s。向量集合被保存在矩陣中。我們假設行主存儲（例如，the j'th component of vector number i is stored in row i, column j of the matrix）。Faiss只能使用32位浮點矩陣。

我們需要兩個矩陣：

xb 為語料, that contains all the vectors that must be indexed, and that we are going to search in. 它的大小為nb-by-d
xq 為查詢的向量集合, for which we need to find the nearest neighbors. 大小為nq-by-d. 如果我們只有一個查詢向量，那么nq=1.

下面例子，我們將學習在d=64維空間中向量，是0-1均勻分布，他們的值在(0,1)范圍內。為了增加娛樂性，我們在第一個向量上加個小平移。

import numpy as np

d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

# import matplotlib.pyplot as plt 

# plt.hist(xb[6])
# plt.show()

創建一個索引，並向它添加向量

Faiss始終圍繞着索引對象展開的. 它封裝了數據向量集合, 並且可以對他們進行預處理，以提高搜索效率. 有很多類型的索引, 我們使用最簡單的一個，執行暴力L2距離搜索（brute-force L2 distance search）：IndexFlatL2.

所有索引構建時都必須指定向量的維度d。而大多數索引還需要一個訓練階段，以便分析向量的分布。對於IndexFlatL2來說，可以跳過訓練這步（因為是暴力搜索，不用分析向量）.

當構建和訓練索引后，在索引上執行兩個操作：add和search.

向索引添加數據，在xb上調用add方法. 有兩個索引的狀態變量：

is_trained, 布爾型，表示是否需要訓練
ntotal, 被索引的向量集合的大小

一些索引也可以對每個向量存儲整型ID(IndexFlatL2不用). 如果不提供ID，使用向量的序號作為id，例如，第一個向量為0，第二個為1……以此類推

import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

結果：

True
True
100000

搜索

在索引上可以執行的最基本操作是 k-nearest-neighbor search(knn), 例如，對每個向量，在數據庫中查找它的 k近鄰.

結果保存在大小為 nq-by-k 的矩陣中, 其中，第i行是其向量i的近鄰id, 按距離升序排序. 除了k近鄰矩陣外, 還會返回一個平方距離(squared distances)的矩陣，其大小為nq-by-k的浮點矩陣。

常用距離計算方法：https://zhuanlan.zhihu.com/p/101277851

先來一個簡單測試，用數據庫中的小部分向量進行檢索，來確保其最近鄰確實是向量本身

先用訓練數據進行檢索，理論上，會返回自己。

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(xb.shape)
print('I', I)
print('D', D)

結果：

(100000, 64)
I [[  0 393 363  78]
[  1 555 277 364]
[  2 304 101  13]
[  3 173  18 182]
[  4 288 370 531]]
D [[0.        7.1751733 7.207629  7.2511625]
[0.        6.3235645 6.684581  6.7999454]
[0.        5.7964087 6.391736  7.2815123]
[0.        7.2779055 7.5279865 7.6628466]
[0.        6.7638035 7.2951202 7.3688145]]

再用查詢向量搜索

D, I = index.search(xq, k)     # actual search
print('I[:5]', I[:5])          # neighbors of the 5 first queries
print('D[:5]', D[:5])
print('-----')
print('I[-5:]', I[-5:])        # neighbors of the 5 last queries
print('D[-5:]', D[-5:])

結果：

I[:5] [[ 381  207  210  477]
[ 526  911  142   72]
[ 838  527 1290  425]
[ 196  184  164  359]
[ 526  377  120  425]]
D[:5] [[6.8154984 6.8894653 7.3956795 7.4290257]
[6.6041107 6.679695  6.7209625 6.828682 ]
[6.4703865 6.8578606 7.0043793 7.036564 ]
[5.573681  6.407543  7.1395226 7.3555984]
[5.409401  6.232216  6.4173393 6.5743675]]
-----
I[-5:] [[ 9900 10500  9309  9831]
[11055 10895 10812 11321]
[11353 11103 10164  9787]
[10571 10664 10632  9638]
[ 9628  9554 10036  9582]]
D[-5:] [[6.5315704 6.97876   7.0039215 7.013794 ]
[4.335266  5.2369385 5.3194275 5.7032776]
[6.072693  6.5767517 6.6139526 6.7323   ]
[6.637512  6.6487427 6.8578796 7.0096436]
[6.2183685 6.4525146 6.548767  6.581299 ]]

結果

進行一下結果的合理性檢查，如果是用訓練數據搜索，得到如下結果

[[  0 393 363  78]
[  1 555 277 364]
[  2 304 101  13]
[  3 173  18 182]
[  4 288 370 531]]

[[ 0.          7.17517328  7.2076292   7.25116253]
[ 0.          6.32356453  6.6845808   6.79994535]
[ 0.          5.79640865  6.39173603  7.28151226]
[ 0.          7.27790546  7.52798653  7.66284657]
[ 0.          6.76380348  7.29512024  7.36881447]]

可以看到：

上面是knn矩陣，結果的確是它自己
下面距離矩陣，相應的距離是0，按升序排序

如果用查詢向量搜索，會得到如下結果

[[ 381  207  210  477]
[ 526  911  142   72]
[ 838  527 1290  425]
[ 196  184  164  359]
[ 526  377  120  425]]

[[ 9900 10500  9309  9831]
[11055 10895 10812 11321]
[11353 11103 10164  9787]
[10571 10664 10632  9638]
[ 9628  9554 10036  9582]]

Because of the value added to the first component of the vectors, the dataset is smeared along the first axis in d-dim space. So the neighbors of the first few vectors are around the beginning of the dataset, and the ones of the vectors around ~10000 are also around index 10000 in the dataset.

(END.)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 facebook 相似性搜索庫 faiss 向量的相似性度量信號相似性的描述矩陣的相似性與對角化 Elasticsearch mapping文檔相似性算法特征相似性度量余弦相似性計算【轉】（原）直方圖的相似性度量樣本間相似性度量 SSIM (Structural Similarity) 結構相似性

Faiss向量相似性搜索

移步bdata-cap.com

Faiss 快速入門（1）

Faiss 更快的索引（2）

Faiss低內存占用（3）

Faiss 構建: clustering, PCA, quantization（4）

如何選擇Faiss索引（5）

官網地址，你也可以訪問我的Github，運行代碼。

本文是基於官網整理，為了防止理解偏差或簡化，有些直接用英文。另外，也加了一點自己的理解。

數據准備

創建一個索引，並向它添加向量

搜索

結果

免責聲明！

Faiss向量相似性搜索

移步bdata-cap.com

Faiss 快速入門（1）

Faiss 更快的索引（2）

Faiss低內存占用（3）

Faiss 構建: clustering, PCA, quantization（4）

如何選擇Faiss索引（5）

官網地址 ，你也可以訪問我的Github，運行代碼。

本文是基於官網整理，為了防止理解偏差或簡化，有些直接用英文。另外，也加了一點自己的理解。

數據准備

創建一個索引，並向它添加向量

搜索

結果

免責聲明！

官網地址，你也可以訪問我的Github，運行代碼。