面對字段類型為數值時,lucene表現得並不是很完美,經常會帶來一些意想不到的“問題”。
下面從索引、排序、范圍檢索(rangeQuery)三個方面進行分析。
搜索我們做好准備工作,建立索引。
RAMDirectory dir = new RAMDirectory(); public void index() { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); try { IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig( Version.LUCENE_36, analyzer)); Random random = new Random(); Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED); Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED); Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED); Fieldable f3 = new NumericField("f3", Store.YES, true); Fieldable f4 = new NumericField("f4", Store.YES, true); for (int i = 0; i < 20; i++) { int value = random.nextInt(100); ((Field) f1).setValue(value + ""); ((Field) f2).setValue(value + random.nextFloat() + ""); ((NumericField) f3).setIntValue(value); ((NumericField) f4).setFloatValue(value + random.nextFloat()); Document doc = new Document(); doc.add(f0); doc.add(f1); doc.add(f2); doc.add(f3); doc.add(f4); writer.addDocument(doc); } writer.close(); } catch (CorruptIndexException e) { e.printStackTrace(); } catch (LockObtainFailedException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
這里共5個字段,
f1:field類型,填充int的StringValue;
f2:field類型,填充float的StringValue;
f3:numericField類型,填充int;
f4:numericField類型,填充float;
共20個document。
排序
從luceneApi可知,排序類型如下:
Field Summary | |
---|---|
static int |
BYTE Sort using term values as encoded Bytes. |
static int |
CUSTOM Sort using a custom Comparator. |
static int |
DOC Sort by document number (index order). |
static int |
DOUBLE Sort using term values as encoded Doubles. |
static SortField |
FIELD_DOC Represents sorting by document number (index order). |
static SortField |
FIELD_SCORE Represents sorting by document score (relevance). |
static int |
FLOAT Sort using term values as encoded Floats. |
static int |
INT Sort using term values as encoded Integers. |
static int |
LONG Sort using term values as encoded Longs. |
static int |
SCORE Sort by document score (relevance). |
static int |
SHORT Sort using term values as encoded Shorts. |
static int |
STRING Sort using term values as Strings. |
static int |
STRING_VAL Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. |
這里我們只關注String、int、float。
public void sort() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); TermQuery query = new TermQuery(new Term("f0", "c")); // SortField field = new SortField("f1", SortField.STRING);// 有問題 // SortField field = new SortField("f1", SortField.INT);// 沒問題 // SortField field = new SortField("f1", SortField.FLOAT);// 沒問題 // SortField field = new SortField("f2", SortField.STRING);// 有問題 // SortField field = new SortField("f2", SortField.INT);//有問題 // SortField field = new SortField("f2", SortField.FLOAT);// 沒問題 // SortField field = new SortField("f3", SortField.STRING);// 有問題 // SortField field = new SortField("f3", SortField.INT);//沒問題 // SortField field = new SortField("f3", SortField.FLOAT);// 沒問題 // SortField field = new SortField("f3", SortField.STRING);// 沒問題 // SortField field = new SortField("f3", SortField.INT);// 沒問題 SortField field = new SortField("f3", SortField.FLOAT);// 沒問題 Sort sort = new Sort(field); TopFieldDocs docs = searcher.search(query, 20, sort); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
從上面的測試可知:
如果使用field類進行索引,排序時可以指定“正確”的數據類型進行排序。使用String類型肯定不行,如果索引的時候存放的是float的StringValue,排序時使用SortField.INT同樣會產生問題,異常如下:
java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)
從異常可以判斷,lucene排序的時候會先將String轉換成指定的數值類型,如果指定錯了(例如將1.2轉成int型)就會遇到異常。
如果使用numericField進行索引,索引的是什么類型排序就選用什么類型。如果考慮其他的太糾結。
范圍檢索
public void rangeSearch() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); // Query query = new TermRangeQuery("f1", "30", "60", true, // true);//有問題 // Query query = NumericRangeQuery.newIntRange("f3", 30, 60, // true, true);//沒問題 // Query query = new TermRangeQuery("f2", "30", "60", true, // true);//有問題 Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true, true);// 沒問題 TopDocs docs = searcher.search(query, 20); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
檢索時,我們常用queryParser,但是queryParser的范圍檢索對數值型不支持,因為lucene沒有記錄哪些域是數值型的,在queryParser解析時也會不特殊處理。
這時我們可以創建queryParser的子類,例如:
public class NumericQueryParser extends QueryParser { protected NumericQueryParser(Version matchVersion, String field, Analyzer a) { super(matchVersion, field, a); } @Override protected org.apache.lucene.search.Query getRangeQuery(String field, String part1, String part2, boolean inclusive) throws ParseException { TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, inclusive); if ("f3".equals(field)) { return NumericRangeQuery.newIntRange(field, Integer.parseInt(query.getLowerTerm()), Integer.parseInt(query.getUpperTerm()), query.includesLower(), query.includesUpper()); } else { return query; } } }
使用其進行范圍檢索:
public void rangeSearch() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); // QueryParser parser = new QueryParser(Version.LUCENE_36, "f0", // analyzer);//有問題 NumericQueryParser parser = new NumericQueryParser( Version.LUCENE_36, "f0", analyzer); Query query = parser.parse("f3:[30 TO 60]"); TopDocs docs = searcher.search(query, 20); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (ParseException e) { e.printStackTrace(); } }
自我提醒:
1、有的問題從表面上不要考慮太多,例如上面的排序,如果是索引的是int,排序int肯定沒有問題,不要再去嘗試string,或者其他數值類型。沒有太多意義!
2、如果要把這些問題考慮情況,從本質下手,從源碼開始!