lucene中的數值型字段(NumericField)


面對字段類型為數值時,lucene表現得並不是很完美,經常會帶來一些意想不到的“問題”。

下面從索引、排序、范圍檢索(rangeQuery)三個方面進行分析。

搜索我們做好准備工作,建立索引。

RAMDirectory dir = new RAMDirectory();

	public void index() {
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
		try {
			IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(
					Version.LUCENE_36, analyzer));
			Random random = new Random();
			Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED);
			Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED);
			Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED);
			Fieldable f3 = new NumericField("f3", Store.YES, true);
			Fieldable f4 = new NumericField("f4", Store.YES, true);
			for (int i = 0; i < 20; i++) {
				int value = random.nextInt(100);
				((Field) f1).setValue(value + "");
				((Field) f2).setValue(value + random.nextFloat() + "");
				((NumericField) f3).setIntValue(value);
				((NumericField) f4).setFloatValue(value + random.nextFloat());
				Document doc = new Document();
				doc.add(f0);
				doc.add(f1);
				doc.add(f2);
				doc.add(f3);
				doc.add(f4);
				writer.addDocument(doc);
			}
			writer.close();
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (LockObtainFailedException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

這里共5個字段,

f1:field類型,填充int的StringValue;

f2:field類型,填充float的StringValue;

f3:numericField類型,填充int;

f4:numericField類型,填充float;

共20個document。

 

排序

從luceneApi可知,排序類型如下:

Field Summary
static int BYTE 
          Sort using term values as encoded Bytes.
static int CUSTOM 
          Sort using a custom Comparator.
static int DOC 
          Sort by document number (index order).
static int DOUBLE 
          Sort using term values as encoded Doubles.
static SortField FIELD_DOC 
          Represents sorting by document number (index order).
static SortField FIELD_SCORE 
          Represents sorting by document score (relevance).
static int FLOAT 
          Sort using term values as encoded Floats.
static int INT 
          Sort using term values as encoded Integers.
static int LONG 
          Sort using term values as encoded Longs.
static int SCORE 
          Sort by document score (relevance).
static int SHORT 
          Sort using term values as encoded Shorts.
static int STRING 
          Sort using term values as Strings.
static int STRING_VAL 
          Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons.

這里我們只關注String、int、float。

public void sort() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermQuery query = new TermQuery(new Term("f0", "c"));
			// SortField field = new SortField("f1", SortField.STRING);// 有問題
			// SortField field = new SortField("f1", SortField.INT);// 沒問題
			// SortField field = new SortField("f1", SortField.FLOAT);// 沒問題

			// SortField field = new SortField("f2", SortField.STRING);// 有問題
			// SortField field = new SortField("f2", SortField.INT);//有問題
			// SortField field = new SortField("f2", SortField.FLOAT);// 沒問題

			// SortField field = new SortField("f3", SortField.STRING);// 有問題
			// SortField field = new SortField("f3", SortField.INT);//沒問題
			// SortField field = new SortField("f3", SortField.FLOAT);// 沒問題

			// SortField field = new SortField("f3", SortField.STRING);// 沒問題
			// SortField field = new SortField("f3", SortField.INT);// 沒問題
			SortField field = new SortField("f3", SortField.FLOAT);// 沒問題
			Sort sort = new Sort(field);
			TopFieldDocs docs = searcher.search(query, 20, sort);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

從上面的測試可知:

如果使用field類進行索引,排序時可以指定“正確”的數據類型進行排序。使用String類型肯定不行,如果索引的時候存放的是float的StringValue,排序時使用SortField.INT同樣會產生問題,異常如下:

java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)

從異常可以判斷,lucene排序的時候會先將String轉換成指定的數值類型,如果指定錯了(例如將1.2轉成int型)就會遇到異常。

如果使用numericField進行索引,索引的是什么類型排序就選用什么類型。如果考慮其他的太糾結。

 

范圍檢索

public void rangeSearch() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
			// Query query = new TermRangeQuery("f1", "30", "60", true,
			// true);//有問題
			// Query query = NumericRangeQuery.newIntRange("f3", 30, 60,
			// true, true);//沒問題
			// Query query = new TermRangeQuery("f2", "30", "60", true,
			// true);//有問題
			Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true,
					true);// 沒問題
			TopDocs docs = searcher.search(query, 20);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

 

檢索時,我們常用queryParser,但是queryParser的范圍檢索對數值型不支持,因為lucene沒有記錄哪些域是數值型的,在queryParser解析時也會不特殊處理。

這時我們可以創建queryParser的子類,例如:

public class NumericQueryParser extends QueryParser {

	protected NumericQueryParser(Version matchVersion, String field, Analyzer a) {
		super(matchVersion, field, a);
	}

	@Override
	protected org.apache.lucene.search.Query getRangeQuery(String field,
			String part1, String part2, boolean inclusive)
			throws ParseException {
		TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field,
				part1, part2, inclusive);
		if ("f3".equals(field)) {
			return NumericRangeQuery.newIntRange(field,
					Integer.parseInt(query.getLowerTerm()),
					Integer.parseInt(query.getUpperTerm()),
					query.includesLower(), query.includesUpper());
		} else {
			return query;
		}
	}

}

  

使用其進行范圍檢索:

public void rangeSearch() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
			// QueryParser parser = new QueryParser(Version.LUCENE_36, "f0",
			// analyzer);//有問題
			NumericQueryParser parser = new NumericQueryParser(
					Version.LUCENE_36, "f0", analyzer);
			Query query = parser.parse("f3:[30 TO 60]");
			TopDocs docs = searcher.search(query, 20);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		}
	}

  

自我提醒:

1、有的問題從表面上不要考慮太多,例如上面的排序,如果是索引的是int,排序int肯定沒有問題,不要再去嘗試string,或者其他數值類型。沒有太多意義!

2、如果要把這些問題考慮情況,從本質下手,從源碼開始!

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM