關於Elasticsearch 使用 MatchPhrase搜索的一些坑


  • 對分詞字段檢索使用的通常是match查詢,對於短語查詢使用的是matchphrase查詢,但是並不是matchphrase可以直接對分詞字段進行不分詞檢索(也就是業務經常說的精確匹配),下面有個例子,使用Es的請注意。
  • 某個Index下面存有如下內容
      {
          "id": "1",
          "fulltext": "亞馬遜卓越有限公司訴訟某某公司"
      }
    

    其中fulltext使用ik分詞器進行分詞存儲,使用ik分詞結果如下

      "tokens": [
          {
            "token": "亞馬遜",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
          },
          {
            "token": "亞",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_WORD",
            "position": 1
          },
          {
            "token": "馬",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 2
          },
          {
            "token": "遜",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
          },
          {
            "token": "卓越",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 4
          },
          {
            "token": "卓",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
          },
          {
            "token": "越有",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 6
          },
          {
            "token": "有限公司",
            "start_offset": 5,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 7
          },
          {
            "token": "有限",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 8
          },
          {
            "token": "公司",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
          },
          {
            "token": "訴訟",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 10
          },
          {
            "token": "訟",
            "start_offset": 10,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 11
          },
          {
            "token": "某某",
            "start_offset": 11,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 12
          },
          {
            "token": "某公司",
            "start_offset": 12,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 13
          },
          {
            "token": "公司",
            "start_offset": 13,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 14
          }
        ]
    

對於如上結果,如果進行matchphrase查詢 “亞馬遜卓越”,無法匹配出任何結果
因為對 “亞馬遜卓越” 進行分詞后的結果為:

    {
      "tokens": [
        {
          "token": "亞馬遜",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亞",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "馬",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "遜",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 6
        }
      ]
    }

和存儲的內容對比發現 原文存儲中包含詞語 “越有”,而查詢語句中並不包含“越有”,包含的是“越”,因此使用matchphrase短語匹配失敗,也就導致了無法檢索出內容。
還是這個例子,換個詞語進行檢索,使用“亞馬遜卓越有”,會發現竟然檢索出來了,對“亞馬遜卓越有”進行分詞得到如下結果:

     {
      "tokens": [
        {
          "token": "亞馬遜",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亞",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "馬",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "遜",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越有",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 6
        }
      ]
    }

注意到了嗎?這里出現了越有這個詞,這也就是說現在的分詞結果和之前的全文分詞結果完全一致了,所以matchphrash也就找到了結果。

再換一個極端點的例子,使用“越有限公司”去進行檢索,你會驚訝的發現,竟然還能檢索出來,對“越有限公司”進行分詞,結果如下:

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限公司",
          "start_offset": 1,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "公司",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 3
        }
      ]
    }

這個結果和原文中的結果又是完全一致(從越有之后的內容一致),所以匹配出來了結果,注意點這里有個詞語“有限公司”,檢索詞語如果我換成了“越有限”,就會發現沒有查詢到內容,因為“越有限”分詞結果為:

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 1
        }
      ]
    }

“越有”這個詞是包含的,”有限”這個詞語也是包含的,但是中間隔了一個“有限公司”,所以沒有完全一致,也就匹配不到結果了。這時候如果我檢索條件設置matchphrase的slop=1,使用“越有限”就能匹配到結果了,現在可以明白了,其實position的位置差就是slop的值,而matchphrase並不是所謂的詞語拼接進行匹配,還是需要進行分詞,以及position匹配的。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM