面對字段類型為數值時,lucene表現得並不是很完美,經常會帶來一些意想不到的“問題”。
下面從索引、排序、范圍檢索(rangeQuery)三個方面進行分析。
搜索我們做好准備工作,建立索引。
RAMDirectory dir = new RAMDirectory();
public void index() {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(
Version.LUCENE_36, analyzer));
Random random = new Random();
Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED);
Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED);
Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED);
Fieldable f3 = new NumericField("f3", Store.YES, true);
Fieldable f4 = new NumericField("f4", Store.YES, true);
for (int i = 0; i < 20; i++) {
int value = random.nextInt(100);
((Field) f1).setValue(value + "");
((Field) f2).setValue(value + random.nextFloat() + "");
((NumericField) f3).setIntValue(value);
((NumericField) f4).setFloatValue(value + random.nextFloat());
Document doc = new Document();
doc.add(f0);
doc.add(f1);
doc.add(f2);
doc.add(f3);
doc.add(f4);
writer.addDocument(doc);
}
writer.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
這里共5個字段,
f1:field類型,填充int的StringValue;
f2:field類型,填充float的StringValue;
f3:numericField類型,填充int;
f4:numericField類型,填充float;
共20個document。
排序
從luceneApi可知,排序類型如下:
| Field Summary | |
|---|---|
static int |
BYTE Sort using term values as encoded Bytes. |
static int |
CUSTOM Sort using a custom Comparator. |
static int |
DOC Sort by document number (index order). |
static int |
DOUBLE Sort using term values as encoded Doubles. |
static SortField |
FIELD_DOC Represents sorting by document number (index order). |
static SortField |
FIELD_SCORE Represents sorting by document score (relevance). |
static int |
FLOAT Sort using term values as encoded Floats. |
static int |
INT Sort using term values as encoded Integers. |
static int |
LONG Sort using term values as encoded Longs. |
static int |
SCORE Sort by document score (relevance). |
static int |
SHORT Sort using term values as encoded Shorts. |
static int |
STRING Sort using term values as Strings. |
static int |
STRING_VAL Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. |
這里我們只關注String、int、float。
public void sort() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f0", "c"));
// SortField field = new SortField("f1", SortField.STRING);// 有問題
// SortField field = new SortField("f1", SortField.INT);// 沒問題
// SortField field = new SortField("f1", SortField.FLOAT);// 沒問題
// SortField field = new SortField("f2", SortField.STRING);// 有問題
// SortField field = new SortField("f2", SortField.INT);//有問題
// SortField field = new SortField("f2", SortField.FLOAT);// 沒問題
// SortField field = new SortField("f3", SortField.STRING);// 有問題
// SortField field = new SortField("f3", SortField.INT);//沒問題
// SortField field = new SortField("f3", SortField.FLOAT);// 沒問題
// SortField field = new SortField("f3", SortField.STRING);// 沒問題
// SortField field = new SortField("f3", SortField.INT);// 沒問題
SortField field = new SortField("f3", SortField.FLOAT);// 沒問題
Sort sort = new Sort(field);
TopFieldDocs docs = searcher.search(query, 20, sort);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
從上面的測試可知:
如果使用field類進行索引,排序時可以指定“正確”的數據類型進行排序。使用String類型肯定不行,如果索引的時候存放的是float的StringValue,排序時使用SortField.INT同樣會產生問題,異常如下:
java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)
從異常可以判斷,lucene排序的時候會先將String轉換成指定的數值類型,如果指定錯了(例如將1.2轉成int型)就會遇到異常。
如果使用numericField進行索引,索引的是什么類型排序就選用什么類型。如果考慮其他的太糾結。
范圍檢索
public void rangeSearch() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
// Query query = new TermRangeQuery("f1", "30", "60", true,
// true);//有問題
// Query query = NumericRangeQuery.newIntRange("f3", 30, 60,
// true, true);//沒問題
// Query query = new TermRangeQuery("f2", "30", "60", true,
// true);//有問題
Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true,
true);// 沒問題
TopDocs docs = searcher.search(query, 20);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
檢索時,我們常用queryParser,但是queryParser的范圍檢索對數值型不支持,因為lucene沒有記錄哪些域是數值型的,在queryParser解析時也會不特殊處理。
這時我們可以創建queryParser的子類,例如:
public class NumericQueryParser extends QueryParser {
protected NumericQueryParser(Version matchVersion, String field, Analyzer a) {
super(matchVersion, field, a);
}
@Override
protected org.apache.lucene.search.Query getRangeQuery(String field,
String part1, String part2, boolean inclusive)
throws ParseException {
TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field,
part1, part2, inclusive);
if ("f3".equals(field)) {
return NumericRangeQuery.newIntRange(field,
Integer.parseInt(query.getLowerTerm()),
Integer.parseInt(query.getUpperTerm()),
query.includesLower(), query.includesUpper());
} else {
return query;
}
}
}
使用其進行范圍檢索:
public void rangeSearch() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
// QueryParser parser = new QueryParser(Version.LUCENE_36, "f0",
// analyzer);//有問題
NumericQueryParser parser = new NumericQueryParser(
Version.LUCENE_36, "f0", analyzer);
Query query = parser.parse("f3:[30 TO 60]");
TopDocs docs = searcher.search(query, 20);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
自我提醒:
1、有的問題從表面上不要考慮太多,例如上面的排序,如果是索引的是int,排序int肯定沒有問題,不要再去嘗試string,或者其他數值類型。沒有太多意義!
2、如果要把這些問題考慮情況,從本質下手,從源碼開始!
