Elasticsearch(GEO)空間檢索查詢python版本
1、Elasticsearch
ES的強大就不用多說了,當你安裝上插件,搭建好集群,你就擁有了一個搜索系統。
當然,ES的集群優化和查詢優化就是另外一個議題了。這里mark一個最近使用的es空間檢索的功能。
2、ES GEO空間檢索
空間檢索顧名思義提供了通過空間距離和位置關系進行檢索的能力。有很多空間索引算法和類庫可供選擇。
ES內置了這種索引方式。下面詳細介紹。
step1:創建索引
def create_index(): mapping = { "mappings": { "poi": { "_routing": { "required": "true", "path": "city_id" }, "properties": { "id": { "type": "integer" }, "geofence_type": { "type": "integer" }, "city_id": { "type": "integer" }, "city_name": { "type": "string", "index": "not_analyzed" }, "activity_id": { "type": "integer" }, "post_date": { "type": "date" }, "rank": { "type": "float" }, # 不管是point還是任意shape, 都用geo_shape,通過type來設置 # type在數據里 "location_point": { "type": "geo_shape" }, "location_shape": { "type": "geo_shape" }, # 在計算點間距離的時候, 需要geo_point類型變量 "point": { "type": "geo_point" } } } } } # 創建索引的時候可以不 mapping es.create_index(index='mapapp', body=mapping) # set_mapping = es_dsl.set_mapping('mapapp', 'poi', body=mapping)
這里我們創建了一個名叫mapapp的索引,映射的設置如mapping所示。
2、批量插入數據bulk
def bulk():
# actions 是一個可迭代對象就行, 不一定是list
workbooks = xlrd.open_workbook('./geo_data.xlsx')
table = workbooks.sheets()[1]
colname = list()
actions = list()
for i in range(table.nrows):
if i == 0:
colname = table.row_values(i)
continue
geo_shape_point = json.loads(table.row_values(i)[7])
geo_shape_shape = json.loads(table.row_values(i)[8])
geo_point = json.loads(table.row_values(i)[9])
raw_data = table.row_values(i)[:7]
raw_data.extend([geo_shape_point, geo_shape_shape, geo_point])
source = dict(zip(colname, raw_data))
geo = GEODocument(**source)
action = {
"_index": "mapapp",
"_type": "poi",
"_id": table.row_values(i)[0],
"_routing": geo.city_id,
#"_source": source,
"_source": geo.to_json(),
}
actions.append(action)
es.bulk(index='mapapp', actions=actions, es=es_handler, max=25)
刷入測試數據,geo_data數據形如:
id geofence_type city_id city_name activity_id post_date rank location_point location_shape point 1 1 1 北京 100301 2016/10/20 100.30 {"type":"point","coordinates":[55.75,37.616667]} {"type":"polygon","coordinates":[[[22,22],[4.87463,52.37254],[4.87875,52.36369],[22,22]]]} {"lat":55.75,"lon":37.616667} 2 1 1 北京 100302 2016/10/21 12.00 {"type":"point","coordinates":[55.75,37.616668]} {"type":"polygon","coordinates":[[[0,0],[4.87463,52.37254],[4.87875,52.36369],[0,0]]]} {"lat":48.8567,"lon":2.3508} 3 1 1 北京 100303 2016/10/22 3432.23 {"type":"point","coordinates":[55.75,37.616669]} {"type":"polygon","coordinates":[[[4.8833,52.38617],[4.87463,52.37254],[4.87875,52.36369],[4.8833,52.38617]]]} {"lat":32.75,"lon":37.616668} 4 1 1 北京 100304 2016/10/23 246.80 {"type":"point","coordinates":[52.4796, 2.3508]} {"type":"polygon","coordinates":[[[4.8833,52.38617],[4.87463,52.37254],[4.87875,52.36369],[4.8833,52.38617]]]} {"lat":11.56,"lon":37.616669}
3、GEO查詢:兩點間距離
# 點與點之間的距離 # 按照距離升序排列,如果size取1個,就是最近的 def sort_by_distance(): body = { "from": 0, "size": 1, "query": { "bool": { "must": [{ "term": { "geofence_type": 1 } }, { "term": { "city_id": 1 } }] } }, "sort": [{ "_geo_distance": { "point": { "lat": 8.75, "lon": 37.616 }, "unit": "km", "order": "asc" } }] } for i in es.search(index='mapapp', doc_type='poi', body=body)['hits']['hits']: print type(i), i
4、GEO查詢:邊界框過濾
tips:大家都知道,ES的過濾是會生成緩存的,所以在優化查詢的時候,常常需要將頻繁用到的查詢提取出來作為過濾呈現,但不幸的是,對於GEO過濾不會生成緩存,所以沒有必要考慮,這里為了做出區分,使用post_filter,查詢后再過濾,下面的都類似。
# 邊界框過濾:用框去圈選點和形狀 # 這里實現了矩形框選中 # post_filter后置filter, 對查詢結果再過濾; aggs常用后置filter def bounding_filter(): body = { "from": 0, "size": 1, "query": { "bool": { "must": [{ "term": { "geofence_type": 1 } }, { "term": { "city_id": 1 } }] } }, "post_filter": { "geo_shape": { "location_point": { "shape": { "type": "envelope", "coordinates": [[52.4796, 2.3508], [48.8567, -1.903]] }, "relation": "within" } } } } for i in es.search(index='mapapp', doc_type='poi', body=body)['hits']['hits']: print type(i), i
5、GEO查詢:圓形圈選
# 邊界框過濾: 圓形圈選 # post_filter后置filter, 對查詢結果再過濾; aggs常用后置filter def circle_filter(): body = { "from": 0, "size": 1, "query": { "bool": { "must": [{ "term": { "geofence_type": 1 } }, { "term": { "city_id": 1 } }] } }, "post_filter": { "geo_shape": { "location_point": { "shape": { "type": "circle", "radius": "10000km", "coordinates": [22, 45] }, "relation": "within" } } } } for i in es.search(index='mapapp', doc_type='poi', body=body)['hits']['hits']: print type(i), i
6、GEO查詢:反選
# 邊界框反選:點落在框中,框被查詢出來 # post_filter后置filter, 對查詢結果再過濾; aggs常用后置filter # 包含正則匹配regexp def intersects(): body = { "from": 0, "size": 1, "query": { "bool": { "must": [{ "term": { "geofence_type": 1 } }, { "regexp": { "city_name": u".*北京.*" } }, { "term": { "city_id": 1 } }] } }, "post_filter": { "geo_shape": { "location_shape": { "shape": { "type": "point", "coordinates": [22,22] }, "relation": "intersects" } } } } for i in es.search(index='mapapp', doc_type='poi', body=body)['hits']['hits']: print type(i), i
7、最后粘兩個空間聚合的例子,作為參考
# 空間聚合 # 按照與中心點距離聚合 def aggs_geo_distance(): body = { "aggs": { "aggs_geopoint": { "geo_distance": { "field": "point", "origin": { "lat": 51.5072222, "lon": -0.1275 }, "unit": "km", "ranges": [ { "to": 1000 }, { "from": 1000, "to": 3000 }, { "from": 3000 } ] } } } } for i in es.search(index='mapapp', doc_type='poi', body=body)['aggregations']['aggs_geopoint']['buckets']: print type(i), i # 空間聚合 # geo_hash算法, 網格聚合grid # 兩次聚合 def aggs_geohash_grid(): body = { "aggs": { "new_york": { "geohash_grid": { "field": "point", "precision": 5 } }, "map_zoom": { "geo_bounds": { "field": "point" } } } } for i in es.search(index='mapapp', doc_type='poi', body=body)['aggregations']['new_york']['buckets']: print type(i), i