Django學習---快速搭建搜索引擎(haystack + whoosh + jieba)

本文轉載自查看原文 2019-02-18 20:07 1512 Django

Django下的搜索引擎(haystack + whoosh + jieba)

軟件安裝

     haystack是django的開源搜索框架，該框架支持Solr,Elasticsearch,Whoosh, 搜索引擎量。
     Whoosh是一個搜索引擎使用，這是一個由純Python實現的全文搜索引擎，沒有二進制文件等，比較小巧，配置比較簡單，性能略低。
     Jieba是由Whoosh自帶的是英文分詞，對中文的分詞支持不是太好，故用jieba替換whoosh的分詞組件。
---------------------

pip install django-haystack
pip install whoosh
pip install jieba

創建項目app

df_goods

修改settings.py：

manas/settings.py

# 添加搜索引擎
HAYSTACK_CONNECTIONS = {
    'default': {
        # 指定使用的搜索引擎
        'ENGINE': 'haystack.backends.whoosh_cn_backend.WhooshEngine',
        # 指定索引文件存放位置
        'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
    }
}

# 新增的數據自動生成索引
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'
HAYSTACK_SEARCH_RESULTS_PER_PAGE = 18

創建索引

在df_goods目錄下簡立search_indexes.py文件，文件名不能修改

# coding=utf-8
from haystack import indexes
from .models import GoodsInfo


class GoodsInfoIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)  # 創建一個text字段


    def get_model(self):
        return GoodsInfo

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

說明：每個索引里面必須有且只能有一個字段為 document=True，這代表haystack 和搜索引擎將使用此字段的內容作為索引進行檢索(primary field)。其他的字段只是附屬的屬性，方便調用，並不作為檢索數據。如果使用一個字段設置了document=True，則一般約定此字段名為text，這是在SearchIndex類里面一貫的命名，以防止后台混亂，當然名字你也可以隨便改，不過不建議改。並且，haystack提供了use_template=True在text字段，這樣就允許我們使用數據模板去建立搜索引擎索引的文件，說得通俗點就是索引里面需要存放一些什么東西，例如 GoodsInfo的 gtitle 字段

創建數據模板路徑

templates/search/indexes/df_goods/goodsinfo_text.txt

說明：數據模板的路徑一般為： templates/search/indexes/yourapp/note_text.txt格式

配置URL

manas/urls.py

df_goods/urls.py

df_goods/views.py

from haystack.views import SearchView
from manas.settings import HAYSTACK_SEARCH_RESULTS_PER_PAGE

class MySearchView(SearchView):
    def build_page(self):
        print('進入搜索頁面：')
        #分頁重寫
        context=super(MySearchView, self).extra_context()   #繼承自帶的context
        try:
            page_no = int(self.request.GET.get('page', 1))
        except Exception:
            return HttpResponse("Not a valid number for page.")

        if page_no < 1:
            return HttpResponse("Pages should be 1 or greater.")
        a =[]
        for i in self.results:
            a.append(i.object)
        paginator = Paginator(a, HAYSTACK_SEARCH_RESULTS_PER_PAGE)
        # print("--------")
        # print(page_no)
        page = paginator.page(page_no)
        print('搜索的商品信息：', page)
        return (paginator, page)

    def extra_context(self):
        context = super(MySearchView, self).extra_context()  # 繼承自帶的context
        context['title']='搜索'
        return context

創建搜索結果顯示的HTML模板路徑

templates/search/search.html

SearchView()視圖函數默認使用的html模板路徑為templates/search/search.html

        <ul class="goods_type_list clearfix">
            {% for result in page.object_list %}
                <li>
                    <a href="/goods/detail/{{ result.object.id }}/"><img src="/upload/{{ result.object.gpic }}"></a>
                    <h4><a href="/detail/{{ result.object.id }}/">{{ result.object.gtitle }}</a></h4>
                    <div class="operate">
                        <span class="prize">{{ result.object.gprice }}</span>
                        <span class="unit">{{ result.object.gprice }}/{{ result.object.gunit }}</span>
                        <a href="/cart/add{{result.object.id}}_1/" class="add_goods" title="加入購物車"></a>
                    </div>
                </li>
            {% endfor %}
        </ul>
        <div class="pagenation">
            {% if page.has_previous %}
                <a href="/search?q={{query}}&page={{page.previous_page_number}}">&lt;上一頁</a>
            {% else %}
                <a href="/search?q={{ query }}">&lt;上一頁</a>
            {% endif %}
            {% if page.number <= 5 %}   <!--當前頁面數小於5時-->
                {% for page_num in paginator.page_range %}
                    {%if forloop.counter <= 5 %}
                    <a href="/search?q={{query}}&page={{page_num}}"
                       {% if page.number == page_num %}
                       class="active"
                       {% endif %}
                    >{{ page_num }}</a>
                    {%endif%}
                {% endfor %}
            {% else %}
                {% if page.number|add:1 > paginator.num_pages %}
                    <a href="/search?q={{query}}&page={{page.number|add:-4}}">{{ page.number|add:-4}}</a>
                {% endif %}
                {% if page.number|add:2 > paginator.num_pages %}
                    <a href="/search?q={{query}}&page={{page.number|add:-3}}">{{ page.number|add:-3}}</a>
                {% endif %}
                <a href="/search?q={{query}}&page={{page.number|add:-2}}" >{{ page.number|add:-2}}</a>
                <a href="/search?q={{query}}&page={{page.number|add:-1}}">{{ page.number|add:-1}}</a>
                <a href="/search?q={{query}}&page={{page.number}}" class="active">{{ page.number }}</a>
                {% if page.number|add:1 <= paginator.num_pages %}
                    <a href="/search?q={{query}}&page={{page.number|add:1}}">{{ page.number|add:1}}</a>
                {% endif %}
                {% if page.number|add:2 <= paginator.num_pages %}
                    <a href="/search?q={{query}}&page={{page.number|add:2}}">{{ page.number|add:2}}</a>
                {% endif %}
            {% endif %}

            {% if page.has_next %}
                <a href="/search?q={{query}}&page={{page.next_page_number}}">下一頁&gt;</a>
            {% else %}
                <a href="/search?q={{query}}&page={{paginator.num_pages}}">下一頁&gt;</a>
            {% endif %}
        </div>
	</div>
{% endblock body %}

說明：首先可以看到模板里使用了的變量有query,page,paginator。query就是我們搜索的字符串; page就是我們的返回結果，page有object_list屬性。

創建搜索引擎文件夾whoosh_index(settings.py已配置)

創建ChineseAnalyzer.py文件
保存在haystack的安裝文件夾下，Linux路徑如“/home/python/.virtualenvs/django_py2/lib/python2.7/site-packages/haystack/backends”
保存在haystack的安裝文件，Window路徑 C:\Users\Administrator\AppData\Roaming\Python\Python35\site-packages\haystack\backends.

import jieba
from whoosh.analysis import Tokenizer, Token


class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False,
                 keeporiginal=False, removestops=True,
                 start_pos=0, start_char=0, mode='', **kwargs):
        t = Token(positions, chars, removestops=removestops, mode=mode,
                  **kwargs)
        seglist = jieba.cut(value, cut_all=True)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_char + value.find(w) + len(w)
            yield t


def ChineseAnalyzer():
    return ChineseTokenizer()

```
添加中文搜索文件
```

修改完成后2個文件的對比

生成索引

python manage.py rebuild_index

(可選更新索引)python manage.py update_index

界面顯示

說明：如果我們的文字描述比較少，就會導致分詞的效果不明顯，所以建議文字描述的時候多一些，這樣便於jieba分詞

附帶分詞文件下載；

點擊下載

jieba的簡單實用

import jieba

list0 = jieba.cut('小明碩士畢業於中國科學院計算所，后在哈佛大學深造', cut_all=True)
print('全模式', list(list0))
# ['小', '明', '碩士', '畢業', '於', '中國', '中國科學院', '科學', '科學院', '學院', '計算', '計算所', '', '', '后', '在', '哈佛', '哈佛大學', '大學', '深造']
list1 = jieba.cut('小明碩士畢業於中國科學院計算所，后在哈佛大學深造', cut_all=False)
print('精准模式', list(list1))
# ['小明', '碩士', '畢業', '於', '中國科學院', '計算所', '，', '后', '在', '哈佛大學', '深造']
list2 = jieba.cut_for_search('小明碩士畢業於中國科學院計算所，后在哈佛大學深造')
print('搜索引擎模式', list(list2))
# ['小明', '碩士', '畢業', '於', '中國', '科學', '學院', '科學院', '中國科學院', '計算', '計算所', '，', '后', '在', '哈佛', '大學', '哈佛大學', '深造']

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python+flask+mongodb+whoosh實現自己的搜索引擎（一）：目錄搜索引擎1 搜索引擎3 django 搜索引擎 Elasticsearch 安裝使用 fofa搜索引擎 SVN搜索引擎 es搜索引擎搜索引擎概述搜索引擎語法 python學習筆記：建立一個自己的搜索引擎