Solr快速入門(一)

本文轉載自查看原文 2017-06-08 11:26 1747 solr/ 管理

概述

本文檔介紹了如何獲取和運行Solr，將各種數據源收集到多個集合中，以及了解Solr管理和搜索界面。
首先解壓縮Solr版本並將工作目錄更改為安裝Solr的子目錄。請注意，基本目錄名稱可能隨Solr下載的版本而有所不同。例如，在UNIX，Cygwin或MacOS中使用shell：

/：$ ls solr *
solr-6.2.0.zip
/：$ unzip -q solr-6.2.0.zip
/：$ cd solr-6.2.0

要啟動Solr，請運行：bin / solr start -e cloud -noprompt(Windows系統cmd命令一樣執行)

/solr-6.4.2:$ bin/solr start -e cloud -noprompt

Welcome to the SolrCloud example!
為您的示例SolrCloud集群啟動2個Solr節點。
Starting up 2 Solr nodes for your example SolrCloud cluster.
...
在端口8983上啟動Solr服務器（pid = 8404）
Started Solr server on port 8983 (pid=8404). Happy searching!
...
在端口7574（pid = 8549）上啟動Solr服務器
Started Solr server on port 7574 (pid=8549). Happy searching!
...

SolrCloud example running, please visit http://localhost:8983/solr

/solr-6.4.2:$ _

通過在Web瀏覽器中加載Solr Admin UI，可以看到Solr正在運行：http：// localhost：8983 / solr /。這是管理Solr的主要起點。

Solr現在將運行兩個“節點”，一個在端口7574上，一個在端口8983上。有一個集合自動創建，開始，兩個分片集合，每個集合有兩個副本。管理界面中的雲標簽很好地描繪了集合：
這里寫圖片描述

索引數據

您的Solr服務器已啟動並正在運行，但它不包含任何數據。 Solr安裝包括bin / post工具，以便於從開始方便地將各種類型的文檔輕松導入Solr。我們將使用此工具作為下面的索引示例。
您將需要一個命令shell來運行這些示例，這些例程位於Solr安裝目錄中;你從哪里推出Solr的shell工作正常。
注意：目前bin / post工具沒有可比較的Windows腳本，但調用的底層Java程序可用。有關詳細信息，請參閱Post Tool, Windows section。

索引“富”文件的目錄

讓我們首先索引本地“富”文件，包括HTML，PDF，Microsoft Office格式（如MS Word），純文本和許多其他格式。 bin / post具有爬取文件目錄的能力，可選地遞歸平均，將每個文件的原始內容發送到Solr中進行提取和索引。 Solr安裝包括一個docs /子目錄，這樣就可以創建一個方便的（主要是）內置的HTML文件。

bin/post -c gettingstarted docs/

以下是它的結果：

/solr-6.4.2:$ bin/post -c gettingstarted docs/
java -classpath /solr-6.4.2/dist/solr-core-6.4.2.jar -Dauto=yes -Dc=gettingstarted -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool docs/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, max depth=999, delay=0s
Indexing directory docs (3 files, depth=0)
POSTing file index.html (text/html) to [base]/extract
POSTing file quickstart.html (text/html) to [base]/extract
POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract
Indexing directory docs/changes (1 files, depth=1)
POSTing file Changes.html (text/html) to [base]/extract
...
4329 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:01:16.252

命令行分解如下：

-c gettingstarted：要索引到的集合的名稱
docs/：Solr安裝docs/目錄的相對路徑

您現在已將數千個文檔索引到Solr中gettingstarted 的集合中，並提交了這些更改。您可以通過加載Admin UI查詢選項卡，在q param（替換*：*，匹配所有文檔）和“Execute Query”中輸入“solr”來搜索“solr”。有關詳細信息，請參閱下面的搜索部分。

要索引自己的數據，請重新運行指向您自己的文檔目錄的目錄索引命令。例如，在Mac而不是docs / try〜/ Documents /或〜/ Desktop /！你可能想從一個干凈的，空的系統再次開始，而不是有你的內容除了Solr docs /目錄;請參閱下面的清理部分，了解如何恢復到一個干凈的起點。

索引Solr XML

Solr支持以各種傳入格式索引結構化內容。用於將結構化內容轉換為Solr的歷史上最主要的格式是Solr XML。許多Solr索引器已經被編碼以將域內容處理成Solr XML輸出，通常HTTP直接發布到Solr的/更新端點。

Solr的安裝包括一些Solr XML格式的文件與示例數據（大多是模擬的技術產品數據）。注意：此技術產品數據具有更多特定於域的配置，包括架構和瀏覽UI。 bin / solr腳本包括通過運行bin / solr start -e techproducts的內置支持，它不僅啟動了Solr，而且還索引了這些數據（在嘗試之前一定要bin / solr stop -all）。但是，下面的示例假設Solr是用bin / solr start -e cloud啟動的，以保持與此頁面上的所有示例一致，因此使用的集合是“gettingstarted”，而不是“techproducts”。

使用bin / post，在example / exampledocs /中索引示例Solr XML文件：

bin/post -c gettingstarted example/exampledocs/*.xml

以下是您會看到的內容：

/solr-6.4.2:$ bin/post -c gettingstarted example/exampledocs/*.xml
java -classpath /solr-6.4.2/dist/solr-core-6.4.2.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/gb18030-example.xml ...
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file mp500.xml (application/xml) to [base]
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr.xml (application/xml) to [base]
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:02.077

...現在你可以使用默認的Solr查詢語法（Lucene查詢語法的超集）搜索所有類型的東西...

注意：您可以瀏覽在http：// localhost：8983 / solr / gettingstarted / browse處索引的文檔。 / browse UI允許了解Solr的技術功能如何在熟悉的，雖然有點粗糙和原型的交互式HTML視圖中工作。（/ browse視圖默認為假設獲取啟動的模式和數據是結構化XML，JSON，CSV示例數據和非結構化豐富文檔的全部組合。您自己的數據可能看起來不太理想，但/ browse模板可自定義。）

索引JSON

Solr支持索引JSON，任意結構化JSON或“Solr JSON”（類似於Solr XML）。

Solr包括一個小樣本Solr JSON文件來說明這個功能。再次使用bin / post，索引樣本JSON文件：

bin/post -c gettingstarted example/exampledocs/books.json

您會看到以下內容：

/solr-6.4.2:$ bin/post -c gettingstarted example/exampledocs/books.json
java -classpath /solr-6.4.2/dist/solr-core-6.4.2.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/books.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.493

有關索引Solr JSON的更多信息，請參閱“Solr參考指南”部分Solr-Style JSON
要展平（和/或拆分）和索引任意結構化JSON，本快速入門指南之外的主題，請查看Transforming and Indexing Custom JSON data(轉換和索引自定義JSON數據)。

索引CSV（逗號/列分隔值）

到Solr的一個很大的數據通過CSV，特別是當文件是同類的所有具有相同的字段集。 CSV可以方便地從電子表格（如Excel）導出，或從數據庫（如MySQL）導出。當開始使用Solr時，通常最容易將結構化數據轉換為CSV格式，然后將其索引到Solr，而不是更復雜的單步操作。

使用bin / post索引包含的示例CSV文件：

bin/post -c gettingstarted example/exampledocs/books.csv

你會看到:

/solr-6.4.2:$ bin/post -c gettingstarted example/exampledocs/books.csv
java -classpath /solr-6.4.2/dist/solr-core-6.4.2.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/books.csv
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.109

有關詳細信息，請參閱“Solr參考指南”一節“CSV格式化索引更新”
其他索引技術

 使用數據導入處理程序（DIH）從數據庫導入記錄。

 使用基於JVM的語言或其他Solr客戶端的SolrJ以編程方式創建要發送到Solr的文檔。

 使用“管理UI文檔”選項卡粘貼要編制索引的文檔，或者從“文檔類型”下拉列表中選擇“文檔生成器”，以便一次創建一個字段。 單擊表單下方的提交文檔按鈕以索引文檔。

更新數據

您可能會注意到，即使您不止一次將本指南中的內容編入索引，也不會重復找到的結果。這是因為示例schema.xml指定了一個名為“id”的“uniqueKey”字段。每當您向Solr發出命令以添加具有與現有文檔uniqueKey相同的值的文檔時，它會自動替換它。您可以通過查看Solr Admin UI的核心特定概述部分中numDocs和maxDoc的值來了解這一點。

numDocs表示索引中可搜索的文檔數（並且將大於XML，JSON或CSV文件的數量，因為一些文件包含多個文檔）。 maxDoc值可能較大，因為maxDoc計數包括尚未從索引中物理刪除的邏輯刪除的文檔。你可以一次又多次重新發布樣例文件，numDocs將永遠不會增加，因為新文檔將不斷地替換舊的。

繼續編輯任何現有的示例數據文件，更改一些數據，然后重新運行SimplePostTool命令。您將看到您的更改反映在后續搜索中。

刪除數據

您可以通過向更新URL發出刪除命令並指定文檔的唯一鍵字段的值或匹配多個文檔的查詢（請小心使用該值）來刪除數據。由於這些命令較小，我們直接在命令行上指定它們，而不是引用JSON或XML文件。

執行以下命令刪除特定文檔：

bin/post -c gettingstarted -d "<delete><id>SP2514N</id></delete>"

搜索

Solr可以通過REST客戶端，cURL，wget，Chrome POSTMAN等，以及通過可用於許多編程語言的本地客戶端查詢。
Solr管理UI包括查詢構建器界面 - 請參閱http：// localhost：8983 / solr /＃/ gettingstarted / query下的啟動查詢選項卡。

如果單擊執行查詢按鈕而不更改窗體中的任何內容，您將獲得10個JSON格式的文檔（*：*在q param中匹配所有文檔）：

這里寫圖片描述

管理UI發送到Solr的URL在上面屏幕截圖的右上角以淺灰色顯示 - 如果您點擊它，您的瀏覽器將顯示原始響應。要使用cURL，請在curl命令行上使用引號將相同的URL：

curl "http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json"

基本

搜索單個字詞

要搜索一個術語，請在核心特定的Solr Admin UI查詢部分中將其作為q param值，將：替換為您要查找的術語。搜索“foundation”：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation"

你會得到:

/solr-6.4.2$ curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation"
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":527,
    "params":{
      "q":"foundation",
      "indent":"true",
      "wt":"json"}},
  "response":{"numFound":4156,"start":0,"maxScore":0.10203234,"docs":[
      {
        "id":"0553293354",
        "cat":["book"],
        "name":["Foundation"],
...

響應指示有4,156次命中（“numFound”：4156），其中返回前10個，因為默認情況下start = 0和rows = 10。您可以指定這些參數以遍歷結果，其中start是要返回的第一個結果的（從零開始）位置，rows是頁面大小。

要限制響應中返回的字段，請使用fl param，它使用逗號分隔的字段名稱列表。例如。只返回id字段：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation&fl=id"

q = foundation匹配幾乎所有我們索引的文檔，因為docs /下的大多數文件都包含“Apache軟件基金會”。要限制搜索到特定字段，請使用語法“q = field：value”，例如。僅在名稱字段中搜索Foundation：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:Foundation"

上述請求只從響應中返回一個文檔（“numFound”：1）

...
  "response":{"numFound":1,"start":0,"maxScore":2.5902672,"docs":[
      {
        "id":"0553293354",
        "cat":["book"],
        "name":["Foundation"],
...

短語搜索

要搜索多術語短語，請將其括在雙引號中：q =“這里的多個術語”。例如。以搜索“CAS延遲” - 請注意，字詞之間的空格必須在網址中轉換為“+”（管理界面會自動處理網址編碼）：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=\"CAS+latency\""

響應:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":391,
    "params":{
      "q":"\"CAS latency\"",
      "indent":"true",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"maxScore":22.027056,"docs":[
      {
        "id":"TWINX2048-3200PRO",
        "name":["CORSAIR  XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"],
        "manu":["Corsair Microsystems Inc."],
        "manu_id_s":"corsair",
        "cat":["electronics", "memory"],
        "features":["CAS latency 2,  2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
...

組合搜索

默認情況下，當您在單個查詢中搜索多個術語和/或短語時，Solr只需要存在其中一個以便文檔匹配。包含更多術語的文檔將在結果列表中排序較高。

您可以要求一個術語或短語的前綴為“+”; 相反，為了不允許存在術語或短語，以“ - ”作為前綴。

要查找包含“one”和“three”兩個術語的文檔，請在Admin UI Query選項卡的q param中輸入+ one + three。因為“+”字符在URL中具有保留用途（編碼空格字符），所以必須將其針對curl的URL編碼為“％2B”：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=%2Bone+%2Bthree"

搜索包含術語“two”但不包含術語“one”的文檔，請在管理UI中的q param中輸入+ two -one。同樣，網址將“+”編碼為“％2B”：

curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=%2Btwo+-one"

深入

有關更多Solr搜索選項，請參閱“Solr參考指南”的“搜索”部分。

面部
Solr最受歡迎的功能之一是刻面。 Faceting允許將搜索結果排列成子集（或桶或類別），為每個子集提供計數。有幾種類型的faceting：字段值，數字和日期范圍，樞軸（決策樹）和任意查詢分面。

場分面
除了提供搜索結果，Solr查詢可以返回包含整個結果集中的每個唯一值的文檔數。

從核心特定的管理界面查詢選項卡，如果您選中“構面”復選框，您將看到一些與構面相關的選項：
這里寫圖片描述
要查看所有文檔中的構面計數（q = ：）：打開構面（facet = true），並通過facet.field參數指定要構面的字段。如果只需要面，且沒有文檔內容，請指定rows = 0。下面的curl命令將返回manu_id_s字段的構面計數：

curl 'http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=*:*&rows=0'\
'&facet=true&facet.field=manu_id_s'

你將看到:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":201,
    "params":{
      "q":"*:*",
      "facet.field":"manu_id_s",
      "indent":"true",
      "rows":"0",
      "wt":"json",
      "facet":"true"}},
  "response":{"numFound":4374,"start":0,"maxScore":1.0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "manu_id_s":[
        "corsair",3,
        "belkin",2,
        "canon",2,
        "apple",1,
        "asus",1,
        "ati",1,
        "boa",1,
        "dell",1,
        "eu",1,
        "maxtor",1,
        "nor",1,
        "uk",1,
        "viewsonic",1,
        "samsung",0]},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

范圍分面

對於數字或日期，通常希望將構面計數分割為范圍而不是離散值。使用示例產品數據的數值范圍分面的主要例子是價格。在/ browse UI中，它如下所示：
這里寫圖片描述
這些價格范圍構面的數據可以使用此命令以JSON格式顯示：

curl 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=on&rows=0'\
'&facet=true'\
'&facet.range=price'\
'&f.price.facet.range.start=0'\
'&f.price.facet.range.end=600'\
'&f.price.facet.range.gap=50'\
'&facet.range.other=after'

你會得到:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":248,
    "params":{
      "facet.range":"price",
      "q":"*:*",
      "f.price.facet.range.start":"0",
      "facet.range.other":"after",
      "indent":"on",
      "f.price.facet.range.gap":"50",
      "rows":"0",
      "wt":"json",
      "facet":"true",
      "f.price.facet.range.end":"600"}},
  "response":{"numFound":4374,"start":0,"maxScore":1.0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{
      "price":{
        "counts":[
          "0.0",19,
          "50.0",1,
          "100.0",0,
          "150.0",2,
          "200.0",0,
          "250.0",1,
          "300.0",1,
          "350.0",2,
          "400.0",0,
          "450.0",1,
          "500.0",0,
          "550.0",0],
        "gap":50.0,
        "after":2,
        "start":0.0,
        "end":600.0}},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

數據透視面

另一種faceting類型是樞軸面，也稱為“決策樹”，允許為所有各種可能的組合嵌套兩個或多個字段。使用示例技術產品數據，樞軸面可以用於查看“書”類別（貓字段）中有多少產品庫存或庫存。以下是獲取此場景的原始數據的方法：

curl 'http://localhost:8983/solr/gettingstarted/select?q=*:*&rows=0&wt=json&indent=on'\
'&facet=on&facet.pivot=cat,inStock'

這導致以下響應（僅修剪為書類別輸出），其中說“書”類別中的14個項目，有12個庫存和2個不存在：

...
"facet_pivot":{
  "cat,inStock":[{
      "field":"cat",
      "value":"book",
      "count":14,
      "pivot":[{
          "field":"inStock",
          "value":true,
          "count":12},
        {
          "field":"inStock",
          "value":false,
          "count":2}]},
...

清理

在您完成本指南時，您可能希望停止Solr並將環境重置回起點。以下命令行將停止Solr並刪除啟動腳本創建的兩個節點中的每個節點的目錄：

bin/solr stop -all ; rm -Rf example/cloud/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Solr5.0快速入門 Spring Data Solr —— 快速入門 solr入門命令 Solr入門介紹 Solr 入門實戰(1)--Solr 簡介及安裝 Solr入門-Solr服務安裝（windows系統） [ solr入門 ] - 在eclipse中發布solr solr入門教程-較詳細 Solr入門之SolrServer實例化方式 [ solr入門 ] - 利用solrJ進行檢索