一個偶然需求,需要對pdf(非掃描)文檔進行索引,
schema.xml
<
fields
>
< field name ="id" type ="string" indexed ="true" stored ="true" required ="true" />
< field name ="content" type ="text_general" indexed ="true" stored ="true" required ="true" />
< field name ="size" type ="slong" indexed ="true" stored ="true" required ="true" />
< dynamicField name ="ignored_*" type ="ignored" multiValued ="true" indexed ="false" stored ="false" />
</ fields >
< uniqueKey >id </ uniqueKey >
< defaultSearchField >content </ defaultSearchField >
< solrQueryParser defaultOperator ="AND" />
< field name ="id" type ="string" indexed ="true" stored ="true" required ="true" />
< field name ="content" type ="text_general" indexed ="true" stored ="true" required ="true" />
< field name ="size" type ="slong" indexed ="true" stored ="true" required ="true" />
< dynamicField name ="ignored_*" type ="ignored" multiValued ="true" indexed ="false" stored ="false" />
</ fields >
< uniqueKey >id </ uniqueKey >
< defaultSearchField >content </ defaultSearchField >
< solrQueryParser defaultOperator ="AND" />
solrconfig.xml需要配置的地方為:
<
requestHandler
name
="/update/extract"
startup ="lazy"
class ="solr.extraction.ExtractingRequestHandler" >
< lst name ="defaults" >
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
< str name ="fmap.content" >content </ str >
< str name ="fmap.stream_size" >size </ str >
< str name ="lowernames" >true </ str >
< str name ="uprefix" >ignored_ </ str >
<!-- capture link hrefs but ignore div attributes -->
< str name ="captureAttr" >true </ str >
<!-- <str name="fmap.a">links</str> -->
<!-- <str name="fmap.div">ignored_div</str> -->
</ lst >
</ requestHandler >
startup ="lazy"
class ="solr.extraction.ExtractingRequestHandler" >
< lst name ="defaults" >
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
< str name ="fmap.content" >content </ str >
< str name ="fmap.stream_size" >size </ str >
< str name ="lowernames" >true </ str >
< str name ="uprefix" >ignored_ </ str >
<!-- capture link hrefs but ignore div attributes -->
< str name ="captureAttr" >true </ str >
<!-- <str name="fmap.a">links</str> -->
<!-- <str name="fmap.div">ignored_div</str> -->
</ lst >
</ requestHandler >
參數解釋:
fmap.source=target : 映射規則,將在pdf文件中提取出的字段(source) 映射到solr中的字段(tar)
uprefix : 如果指定了該參數,任何在schema中未定義的字段,都將以該參數指定的值作為字段名前綴
defaultField : 如果沒有指定uprefix參數值,且有字段無法在schema中無法檢測到,則使用defaultField指定的字段名
captureAttr :(true|false)捕獲屬性,對Tika XHTML 元素的屬性進行索引。
literal:自定義metadata信息,也就是給schema文件中定義的某一個字段指定一個值
提交文檔進行索引:
curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=ignored_undefined" -F "commit=true" -F "file=@t2.pdf"
具體的參考文檔:
注:對word文檔的處理與pdf的方法一樣哦