[ lucene高級 ] lucene准實時(near realtime)檢索


先撇開其他的不談,我們先看看下面幾段代碼,他們均能實現“實時”檢索。

注意:

1.筆者目前采用的lucene版本為3.5.

2.為了檢查是否“實時”,采用了numDocs是否發生變化進行簡易判斷。

3.請正確理解這里的提到的“實時”,並與“准實時”予以區分。

方式一:indexWriter每次都commit,indexReader每次都open(dir)

public void nrtOpenDir() {
		try {
			Document doc = new Document();
			Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
			doc.add(f);
			for (int i = 0; i < 20; i++) {
				w.addDocument(doc);
				w.commit();
				IndexReader r = IndexReader.open(dir);
				System.out.println(r.numDocs());
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}

	}

以上方式,就小數據量測試而言,是可以獲得“實時”檢索的效果。問題有兩個:

1.在大數據量的情況下indexWriter.commit()很耗時!(This may be a costly operation, so you should test the cost in your application and do it only when really necessary.

2.在大數據量的情況下indexReader.open()很耗時!

因此,不要在實際項目中使用以上這種方式!(注意!!!!!!)

方式二:indexWriter每次都commit,indexReader每次都reopen()

/**
	 * reopen -> openIfChanged
	 */
	public void nrtReopen() {
		try {
			Document doc = new Document();
			Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
			doc.add(f);
			IndexReader r = IndexReader.open(dir);
			for (int i = 0; i < 20; i++) {
				w.addDocument(doc);
				w.commit();
				// r = r.reopen();
				r = IndexReader.openIfChanged(r);
				System.out.println(r.numDocs());
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	} 

在3.5中openIfChanged(r)代替了reopen()方法。

open(dir)的確是個很費時的過程,openIfChanged會比open省時些,因為他只刷新增那部分內容。(Opening an IndexReader is an expensive operation. This method can be used to refresh an existing IndexReader to reduce these costs. This method tries to only load segments that have changed or were created after the IndexReader was (re)opened.

不過該方式任然需要commit,因此也不建議使用!!!

方式三:indexWriter不用每次commit,indexReader每次都open(indexWriter)

public void nrtNRT() {
		try {
			Document doc = new Document();
			Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
			doc.add(f);
			for (int i = 0; i < 20; i++) {
				w.addDocument(doc);
				// IndexReader r = w.getReader();
				IndexReader r = IndexReader.open(w, false);
				System.out.println(r.numDocs());
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

方式三中沒有了commit操作,那么IndexReader.open(w,false)和IndexReader.openIfChanged(r)效率上又有什么區別呢?

下面我們做一個簡單的實驗!

實驗只對IndexReader.open(w,false)和IndexReader.openIfChanged(r)的效率進行比較,主要代碼如下:

openIfChanged(r)方式(方式a):

long bt = System.currentTimeMillis();
IndexReader r = IndexReader.open(dir);
for (int i = 0; i < readDocCount; i++) {
	IndexReader nr = IndexReader.openIfChanged(r);
	if (nr != null)
	r = nr;
}
long et = System.currentTimeMillis();
System.out.println("reopen:" + (et - bt) + "ms");

open(w)方式(方式b):

long bt = System.currentTimeMillis();
IndexReader r = null;
for (int i = 0; i < readDocCount; i++) {
	r = IndexReader.open(w, false);
}
long et = System.currentTimeMillis();
System.out.println("nrt:" + (et - bt) + "ms");

建立一個簡單的索引,就一個字段,然后添加100000個文檔,測試發現方式a確實比方式b要快,隨着不斷的往索引中添加文檔,

兩種方式的耗時也有所增加。

 

 

 


 1.NRT原理

When you ask for the IndexReader from the IndexWriter, the IndexWriter will be flushed (docs accumulated in RAM will be written to disk) but not committed (fsync files, write new segments file, etc). The returned IndexReader will search over previously committed segments, as well as the new, flushed but not committed segment. Because flushing will likely be processor rather than IO bound, this should be a process that can be attacked with more processor power if found to be too slow.

Also, deletes are carried in RAM, rather than flushed to disk, which may help in eeking a bit more speed. The result is that you can add and remove documents from a Lucene index in ‘near’ real time by continuously asking for a new Reader from the IndexWriter every second or couple seconds. I haven’t seen a non synthetic test yet, but it looks like its been tested at around 50 documents updates per second without heavy slowdown (eg the results are visible every second).

The patch takes advantage of LUCENE-1483, which keys FieldCaches and Filters at the individual segment level rather than at the index level – this allows you to only reload caches per segment rather then per index – essential for real-time search with filter/cache use.

 

2.NRT大數據量情況下的效率問題 

 

 

3.solr中的NRT實現

Near realtime search means thats documents are available for search almost immediately after being indexed - additions and updates to documents are seen in 'near' realtime.

Near realtime search will be added to Solr in version 4.0 and is currently available on trunk.

You can now modify a commit command to be a 'soft' commit. A soft commit will avoid parts of the standard commit that can be costly. You still will want to do normal commits to ensure that documents are on stable storage, but soft commits allow users to see a very near realtime view of the index in the meantime. Be sure to pay special attention to cache and autowarm settings as they can have a significant impact on NRT performance.

A common configuration might be to 'hard' auto commit every 1-10 minutes and 'soft' auto commit every second. With this configuration, new documents will show up within about a second of being added, and if the power goes out, you will be certain to have a consistent index up to the last 'hard' commit.

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM