實時分布式搜索引擎比較(senseidb、Solr、elasticsearch)


  • 1 1. Solr
    • 1.1 Features
    • 1.2 Pros & Cons
    • 1.3 References
  • 2 2. Senseidb
    • 2.1 Features
    • 2.2 Pros & Cons
    • 2.3 為何沒有直接用Solr?
    • 2.4 References
  • 3 3. elasticsearch
    • 3.1 Features
    • 3.2 Pros & Cons
    • 3.3 References
  • 4 4. Conclusion
  • 5 其它參考文獻

比較的時候,主要關注以下幾個方面:

  1. Clustering
    • Scalability on Storage and Service
    • High Availability Considerations
  2. Features
  3. Flexibility

1. Solr

很顯然, Solr跟Lucene是一家,所以,對Lucene做了很多擴展,與lucene的集成也比較好,而且,業界貌似求穩的都會選擇Solr來構建他們的搜索體系。

但SolrCloud才是最終的理想解決方案,而SolrCloud還沒有production-ready。

下面是Solr相關的架構圖:

 image solr architecture

1.1 Features

Solr的首頁上對自己的特性羅列闡述的很詳細了,這里不再贅述。

1.2 Pros & Cons

  • Pros
    • 成熟且驗證過的方案
    • 文檔資料豐富
    • 社區活躍
    • plugin extension points
  • Cons
    • 貌似體系比較龐雜, replication的架構擴展有稍許問題?!

1.3 References

  1. New SolrCloud Design
  2. Scaling Lucene and Solr
  3. Turbocharging Solr Index Replication with BitTorrent
    • funny and sparkling idea by introducing BitTorrent replication mechanism *****
  4. Distributed Searching
  5. Carrot2-OSS framework for building search clustering engine
    • Solr search results clustering is based on the Carrot2 real-time document clustering engine.
  6. Clustering Component
    • 結果集的分類
  7. New SolrCloud Design
  8. SolrCloud
  9. UniqueKey
  10. Solr Near Realtime Search
    • will be added in Solr4, currently available in trunk
  11. Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

2. Senseidb

architecture of sensei

architecture of sensei

2.1 Features

  1. 主要解決高速索引更新的問題;
    • 底層是zoie的“2-swapping-in-memory-index + 1-on-disk-index”索引結構支持
  2. 需要定義schema;
  3. 通過Gateway可以接入多種數據源;
  4. 通過BQL或者REST API,甚至各種語言bindings進行數據查詢;
  5. 支持通過hadoop MR job批量更新數據索引;

2.2 Pros & Cons

  • Pros
    • 高速索引更新
    • 多數據源接入
    • 靈活的訪問接口
    • 與hadoop生態的集成
    • 優秀的分布式擴展能力
  • Cons
    • static schema
    • application side versioning maitaining

2.3 為何沒有直接用Solr?

摘錄在John Wang的訪談片段:

Sensei leverages Lucene.

We weren’t able to leverage Solr because of the following requirements:

    * High update requirement, 10’s of thousands updates per second in to the system
    * Real distributed solution, current Solr’s distributed story has a SPOF at the master, and Solr Cloud is not yet completed.
    * Complex faceting support. Not just your standard terms based faceting. We needed to facet on social graph, dynamic time ranges and many other interesting faceting scenarios. Faceting behavior also needs to be highly customizable, which is not available via Solr.

2.4 References

  1. Introducing SenseiDB 1.0: an open-source, distributed, realtime, semi-structured database
  2. Sensei: distributed, realtime, semi-structured database

3. elasticsearch

很新,當前0.19RC3版本, 文檔缺乏。不過, ES確實有很多值得喝彩的地方。

image

3.1 Features

  1. Schema-Free | Schemaless
  2. feed index engine with JSON formatted documents
  3. Query by Lucene based query string or JSON based query DSL over HTTP or Native Java;
  4. shards and replicas, LB and routings
  5. cloud integration
  6. multiple search types
  7. multiple data sources integration with River
  8. many more...

3.2 Pros & Cons

  • Pros
    • 許多靈活, 優秀的特性(見features列表)
    • 作者擁有多年在搜索領域的涉獵經驗
    • senseidb的pros它也基本都有
  • Cons
    • 文檔不足
    • 后端沒有大的商業機構支持

3.3 References

  1. quick intro to elastic search
  2. Flume, Hive and realtime indexing via ElasticSearch
  3. The Future of Compass & ElasticSearch
  4. Elastic Search: Distributed, Lucene-based Search Engine
  5. ElasticSearch at berlinbuzzwords 2010
  6. Elastic Search Vs. Apache Solr
    • 這篇貌似傾向於ES比較多一些
  7. Your Data, Your Search
  8. Search Engine Time Machine
    • transient狀態與持久化狀態的結合, write behind策略
  9. NoSQL, Yes Search
    • 多種數據源類型的平滑接入
  10. Geo Location and Search
    • 基於geo進行排序的特性很新穎
  11. Zero Conf Persistency
    • Local Gateway (Local Storage | Local FileSystem)
  12. The River
    • ES里River的概念跟Senseidb里Gateway的概念相近, 是數據源通道的意思,可以根據不同的數據源給出不同的River實現,比如基於MysqlBinlog的River, 基於Hbase的River,或者RabbitMQ RiverCouchDB River etc.
  13. Percolator
    • 這個Percolator是ES里的概念,不要跟Google的Percolator混淆
  14. Versioning
    • Optimistic Concurrency Control
  15. New Search Types
    • Introduce count and scan search types, the latter can be used to scroll large result set
  16. Data Visualization with ElasticSearch and Protovis
  17. Distributed Diagram (Video)
  18. Road to a Distributed Search Engine (Video)*****

4. Conclusion

  1. All are based on Lucene.
  2. All are distributed.
    • senseidb shards with multi-write?!
    • solr shards with master-slaves and slave pull strategy;
    • elasticsearch shards with primary-secondary push strategy;
  3. All do partitioning in document granularity, All require some unique key for each document(optional for some situations);
  4. Sensei is good at real-time index update; Solr is good at stable and wide adoption; Elasticsearch is good at flexible and good ideas;

5 其它參考文獻

  1. Lily架構簡介
    • 在自己的lily node里實現了multiwrite + wal+ message queue的數據分發, 沒有充分利用現有系統中各個組件/系統的能力(雖然是基於hbase的table實現的), 部分上來講把事情搞復雜了。

引自:http://afoo.me/notes-on-senseidb-solr-and-elasticsearch.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM