IndexFlatL2、IndexIVFFlat、IndexIVFPQ三種索引方式示例

本文轉載自查看原文 2019-03-21 08:31 3390 Faiss/ Faiss、IndexFlatL2、IndexIVFFlat、IndexIVFPQ

　　上文針對Faiss安裝和一些原理做了簡單說明，本文針對標題所列三種索引方式進行編碼驗證。

　　首先生成數據集，這里采用100萬條數據，每條50維，生成數據做本地化保存，代碼如下：

import numpy as np

# 構造數據
import time
d = 50                           # dimension
nb = 1000000                     # database size
# nq = 1000000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
# xq = np.random.random((nq, d)).astype('float32')
# xq[:, 0] += np.arange(nq) / 1000.

print(xb[:1])

# 寫入文件中
# file = open('data.txt', 'w')
np.savetxt('data.txt', xb)

　　在上述訓練集的基礎上，做自身查詢，即本身即是Faiss的訓練集也是查尋集，三個索引的查詢方式在一個文件內，如下示例代碼：

import numpy as np
import faiss

# 讀取文件形成numpy矩陣
data = []
with open('data.txt', 'rb') as f:
    for line in f:
        temp = line.split()
        data.append(temp)
print(data[0])
# 訓練與需要計算的數據
dataArray = np.array(data).astype('float32')

# print(dataArray[0])
# print(dataArray.shape[1])
# 獲取數據的維度
d = dataArray.shape[1]

# IndexFlatL2索引方式
# # 為向量集構建IndexFlatL2索引，它是最簡單的索引類型，只執行強力L2距離搜索
# index = faiss.IndexFlatL2(d)   # build the index
# index.add(dataArray)                  # add vectors to the index
#
# # we want to see 4 nearest neighbors
# k = 11
# # search
# D, I = index.search(dataArray, k)
#
# # neighbors of the 5 first queries
# print(I[:5])

# IndexIVFFlat索引方式
# nlist = 100 # 單元格數
# k = 11
# quantizer = faiss.IndexFlatL2(d)  # the other index  d是向量維度
# index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# # here we specify METRIC_L2, by default it performs inner-product search
#
# assert not index.is_trained
# index.train(dataArray)
# assert index.is_trained
# index.add(dataArray)                  # add may be a bit slower as well
# index.nprobe = 10        # 執行搜索訪問的單元格數（nlist以外）      # default nprobe is 1, try a few more
# D, I = index.search(dataArray, k)     # actual search
#
# print(I[:5]) # neighbors of the 5 last queries

# IndexIVFPQ索引方式
nlist = 100
m = 5
k = 11
quantizer = faiss.IndexFlatL2(d)  # this remains the same
# 為了擴展到非常大的數據集，Faiss提供了基於產品量化器的有損壓縮來壓縮存儲的向量的變體。壓縮的方法基於乘積量化。
# 損失了一定精度為代價， 自身距離也不為0， 這是由於有損壓縮。
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# 8 specifies that each sub-vector is encoded as 8 bits
index.train(dataArray)
index.add(dataArray)
# D, I = index.search(xb[:5], k) # sanity check
# print(I)
# print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(dataArray, k)     # search
print(I[:5])

　　三種索引的結果和運行時長統計如下圖所示：

　　從上述結果可以看出，加聚類后運行速度比暴力搜索提升很多，結果准確度也基本一致，加聚類加量化運行速度更快，結果相比暴力搜索差距較大，在數據量不是很大、維度不高的情況下，建議選擇加聚類的索引方式即可。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 join的三種方式緩存的三種方式 Python實現定時執行任務的三種方式簡單示例啟動tomcat三種方式 KingbaseES 自增列三種方式 hadoop三種啟動方式 css 三種引用方式實現 SPA 的三種方式 sqlplus連接的三種方式深拷貝的三種方式