關於Elasticsearch 使用 MatchPhrase搜索的一些坑

本文轉載自查看原文 2018-01-03 12:21 5374 ElasticSearch

對分詞字段檢索使用的通常是match查詢，對於短語查詢使用的是matchphrase查詢，但是並不是matchphrase可以直接對分詞字段進行不分詞檢索（也就是業務經常說的精確匹配），下面有個例子，使用Es的請注意。

某個Index下面存有如下內容

  {
      "id": "1",
      "fulltext": "亞馬遜卓越有限公司訴訟某某公司"
  }

其中fulltext使用ik分詞器進行分詞存儲，使用ik分詞結果如下

  "tokens": [
      {
        "token": "亞馬遜",
        "start_offset": 0,
        "end_offset": 3,
        "type": "CN_WORD",
        "position": 0
      },
      {
        "token": "亞",
        "start_offset": 0,
        "end_offset": 1,
        "type": "CN_WORD",
        "position": 1
      },
      {
        "token": "馬",
        "start_offset": 1,
        "end_offset": 2,
        "type": "CN_CHAR",
        "position": 2
      },
      {
        "token": "遜",
        "start_offset": 2,
        "end_offset": 3,
        "type": "CN_WORD",
        "position": 3
      },
      {
        "token": "卓越",
        "start_offset": 3,
        "end_offset": 5,
        "type": "CN_WORD",
        "position": 4
      },
      {
        "token": "卓",
        "start_offset": 3,
        "end_offset": 4,
        "type": "CN_WORD",
        "position": 5
      },
      {
        "token": "越有",
        "start_offset": 4,
        "end_offset": 6,
        "type": "CN_WORD",
        "position": 6
      },
      {
        "token": "有限公司",
        "start_offset": 5,
        "end_offset": 9,
        "type": "CN_WORD",
        "position": 7
      },
      {
        "token": "有限",
        "start_offset": 5,
        "end_offset": 7,
        "type": "CN_WORD",
        "position": 8
      },
      {
        "token": "公司",
        "start_offset": 7,
        "end_offset": 9,
        "type": "CN_WORD",
        "position": 9
      },
      {
        "token": "訴訟",
        "start_offset": 9,
        "end_offset": 11,
        "type": "CN_WORD",
        "position": 10
      },
      {
        "token": "訟",
        "start_offset": 10,
        "end_offset": 11,
        "type": "CN_WORD",
        "position": 11
      },
      {
        "token": "某某",
        "start_offset": 11,
        "end_offset": 13,
        "type": "CN_WORD",
        "position": 12
      },
      {
        "token": "某公司",
        "start_offset": 12,
        "end_offset": 15,
        "type": "CN_WORD",
        "position": 13
      },
      {
        "token": "公司",
        "start_offset": 13,
        "end_offset": 15,
        "type": "CN_WORD",
        "position": 14
      }
    ]

對於如上結果，如果進行matchphrase查詢 “亞馬遜卓越”，無法匹配出任何結果
因為對 “亞馬遜卓越” 進行分詞后的結果為：

    {
      "tokens": [
        {
          "token": "亞馬遜",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亞",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "馬",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "遜",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 6
        }
      ]
    }

和存儲的內容對比發現原文存儲中包含詞語 “越有”，而查詢語句中並不包含“越有”，包含的是“越”，因此使用matchphrase短語匹配失敗，也就導致了無法檢索出內容。
還是這個例子，換個詞語進行檢索，使用“亞馬遜卓越有”，會發現竟然檢索出來了，對“亞馬遜卓越有”進行分詞得到如下結果：

     {
      "tokens": [
        {
          "token": "亞馬遜",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亞",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "馬",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "遜",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越有",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 6
        }
      ]
    }

注意到了嗎？這里出現了越有這個詞，這也就是說現在的分詞結果和之前的全文分詞結果完全一致了，所以matchphrash也就找到了結果。

再換一個極端點的例子，使用“越有限公司”去進行檢索，你會驚訝的發現，竟然還能檢索出來，對“越有限公司”進行分詞，結果如下：

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限公司",
          "start_offset": 1,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "公司",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 3
        }
      ]
    }

這個結果和原文中的結果又是完全一致（從越有之后的內容一致），所以匹配出來了結果，注意點這里有個詞語“有限公司”，檢索詞語如果我換成了“越有限”，就會發現沒有查詢到內容，因為“越有限”分詞結果為：

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 1
        }
      ]
    }

“越有”這個詞是包含的，”有限”這個詞語也是包含的，但是中間隔了一個“有限公司”，所以沒有完全一致，也就匹配不到結果了。這時候如果我檢索條件設置matchphrase的slop=1，使用“越有限”就能匹配到結果了，現在可以明白了，其實position的位置差就是slop的值，而matchphrase並不是所謂的詞語拼接進行匹配，還是需要進行分詞，以及position匹配的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 關於setState使用的一些坑 pagehelper的使用和一些坑！ springboot結合elasticsearch使用的一些例子使用Golang時遇到的一些坑 this.setState使用時的一些坑 2小程序canvas使用，及一些坑，以及自己的一些小總結 elasticsearch配置文件里的一些坑 [Failed to load settings from [elasticsearch.yml]] AndroidStudio的一些坑使用FeignClient，消費方使用方法，和一些坑 CocoaPods 使用方法以及遇到的一些坑