MonogoDB 查詢小結


MonogoDB是一種NoSQL數據庫

優點:

   1.數據的存儲以json的文檔進行存儲(面向文檔存儲)

   2.聚合框架查詢速度快

        3.高效存儲二進制大對象

缺點:

  1.不支持事務

       2.文件存儲空間占用過大

案例學習

例1:單個變量查詢(查找出制造商字段為“Porsche”的所有汽車的查詢)

{
    "layout" : "rear mid-engine rear-wheel-drive layout",
    "name" : "Porsche Boxster",
    "productionYears" : [ ],
    "modelYears" : [ ],
    "bodyStyle" : "roadster",
    "assembly" : [
        "Finland",
        "Germany",
        "Stuttgart",
        "Uusikaupunki"
    ],
    "class" : "sports car",
    "manufacturer" : "Porsche"
}
def porsche_query():
    #{'字段名':'字段值'}
    query = {'manufacturer':'Porsche'}
    return query

例2:范圍查詢 (找出在二十一世紀建成的所有城市注意運算符 $gte,$lte)

{
 'areaCode': ['916'],
 'areaLand': 109271000.0,
 'country': 'United States',
 'elevation': 13.716,
 'foundingDate': datetime.datetime(2000, 7, 1, 0, 0),
 'governmentType': ['Council\u2013manager government'],
 'homepage': ['http://elkgrovecity.org/'],
 'isPartOf': ['California', u'Sacramento County California'],
 'lat': 38.4383,
 'leaderTitle': 'Chief Of Police',
 'lon': -121.382,
 'motto': 'Proud Heritage Bright Future',
 'name': 'City of Elk Grove',
 'population': 155937,
 'postalCode': '95624 95757 95758 95759',
 'timeZone': ['Pacific Time Zone'],
 'utcOffset': ['-7', '-8']
}
def range_query():
#使用$gt,$lt來限定查詢的條件的范圍 query
= {'foundingDate':{'$gte':datetime(2001,1,1),'$lt':datetime(2100,12,31)}} return query

例3:找出在德國、英國或日本組裝的所有汽車

{
    "layout" : "rear mid-engine rear-wheel-drive layout",
    "name" : "Porsche Boxster",
    "productionYears" : [ ],
    "modelYears" : [ ],
    "bodyStyle" : "roadster",
    "assembly" : [
        "Finland",
        "Germany",
        "Stuttgart",
        "Uusikaupunki"
    ],
    "class" : "sports car",
    "manufacturer" : "Porsche"
}
def in_query():
#使用$in來找出滿足調節的集合 query
= {'assembly':{'$in':['Germany','England','Japan']}} return query

例4:點表示法 找出寬度大於 2.5 的所有汽車 

{
    "_id" : ObjectId("52fd438b5a98d65507d288cf"),
    "engine" : "Crawler-transporter__1",
    "dimensions" : {
        "width" : 34.7472,
        "length" : 39.9288,
        "weight" : 2721000
    },
    "transmission" : "16 traction motors powered by four  generators",
    "modelYears" : [ ],
    "productionYears" : [ ],
    "manufacturer" : "Marion Power Shovel Company",
    "name" : "Crawler-transporter"
}
def dot_query():
#使用.來表示父節點中的子節點 query
= {'dimensions.width':{'$gt':2.5}} return query

聚合框架查詢

例5:找出創建推特時最常用的應用

思路語法:$group分組,創建一個變量count,使用$sum計算分組后的數據的條數

示例文件

{
    "_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
    "text" : "First week of school is over :P",
    "in_reply_to_status_id" : null,
    "retweet_count" : null,
    "contributors" : null,
    "created_at" : "Thu Sep 02 18:11:25 +0000 2010",
    "geo" : null,
    "source" : "web",
    "coordinates" : null,
    "in_reply_to_screen_name" : null,
    "truncated" : false,
    "entities" : {
        "user_mentions" : [ ],
        "urls" : [ ],
        "hashtags" : [ ]
    },
    "retweeted" : false,
    "place" : null,
    "user" : {
        "friends_count" : 145,
        "profile_sidebar_fill_color" : "E5507E",
        "location" : "Ireland :)",
        "verified" : false,
        "follow_request_sent" : null,
        "favourites_count" : 1,
        "profile_sidebar_border_color" : "CC3366",
        "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
        "geo_enabled" : false,
        "created_at" : "Sun May 03 19:51:04 +0000 2009",
        "description" : "",
        "time_zone" : null,
        "url" : null,
        "screen_name" : "Catherinemull",
        "notifications" : null,
        "profile_background_color" : "FF6699",
        "listed_count" : 77,
        "lang" : "en",
        "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
        "statuses_count" : 2475,
        "following" : null,
        "profile_text_color" : "362720",
        "protected" : false,
        "show_all_inline_media" : false,
        "profile_background_tile" : true,
        "name" : "Catherine Mullane",
        "contributors_enabled" : false,
        "profile_link_color" : "B40B43",
        "followers_count" : 169,
        "id" : 37486277,
        "profile_use_background_image" : true,
        "utc_offset" : null
    },
    "favorited" : false,
    "in_reply_to_user_id" : null,
    "id" : NumberLong("22819398300")
}
def make_pipeline():
    pipeline = [
# 1.根據source進行分組,然后統計出每個分組的數量,放在count中
# 2.根據count字段降序排列
{
'$group':{'_id':'$source', 'count':{'$sum':1}}}, {'$sort':{'count':-1}} ] return pipeline

例6:找出巴西利亞時區的用戶,哪些用戶發推次數不低於 100 次,哪些用戶的關注者數量最多

def make_pipeline():
    #1.使用$match將數據進行篩選
#2.使用$project(投影運算),獲取結果的返回值
#3.使用$sort根據followers的值降序排列
#4.使用$limit來限制展示的條數,第一條就是滿足條件的結果
pipeline = [ {'$match':{'user.time_zone':'Brasilia', 'user.statuses_count':{'$gte':100}}}, {'$project':{'followers':'$user.followers_count', 'screen_name':'$user.screen_name', 'tweets':'$user.statuses_count'}}, {'$sort':{'followers':-1}}, {'$limit':1} ] return pipeline

例7:找出印度的哪個地區包括的城市最多

$match進行條件篩選,類似SQL語法的where

$unwind對列表的數據進行拆分,如果數據以列表的形式存放,$unwind會將列表每一項單獨和文件進行關聯

$sort對文件中的元素進行排序

示例文件

{
    "_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
    "elevation" : 1855,
    "name" : "Kud",
    "country" : "India",
    "lon" : 75.28,
    "lat" : 33.08,
    "isPartOf" : [
        "Jammu and Kashmir",
        "Udhampur district"
    ],
    "timeZone" : [
        "Indian Standard Time"
    ],
    "population" : 1140
}
def make_pipeline():
    #1.根據$match篩選出國家
#2.根據$unwind將列表形式的字段進行拆分
#3.根據$group將拆分的項進行分組,並統計出總數count
#4.根據$sort將總數count進行降序排列,找出結果集
pipeline = [ {'$match':{'country':'India'}}, {'$unwind':'$isPartOf'}, {'$group':{'_id':'$isPartOf', 'count':{'$sum':1}}}, {'$sort':{'count':-1}} ] return pipeline

例8:找出每個用戶的所有推特文本數量,僅數出推特數量在前五名的用戶。

$push將每一項數據聚合成列表(允許重復的元素)

$addToSet將每一項數據聚合成列表(允許重復的元素)

def make_pipeline():
    #1.使用$group根據screen_name進行分組
#2.使用$push將所有的text的值放入到tweet_texts中
#3.使用$sum統計出總數
#4.使用$sort將總數count進行降序排列
#5.使用$limit獲取前5的用戶
pipeline = [ {'$group':{'_id':'$user.screen_name', 'tweet_texts':{'$push':'$text'}, 'count':{'$sum':1}}}, {'$sort':{'count':-1}}, {'$limit':5} ] return pipeline

例9:找出印度各個地區的平均人口數量是多少

def make_pipeline():
    #1.使用$match篩選出國家India
#2.使用$unwind對isPartOf進行拆分
#3.使用$group將isPartOf進行分組,在使用$avg計算出平均人口
#4.使用$group將avg的值進行展示
pipeline = [ {'$match':{'country':'India'}}, {'$unwind':'$isPartOf'}, {'$group':{'_id':'$isPartOf', 'avgp':{'$avg':'$population'}}}, {'$group':{'_id':'India Regional City Population avg', 'avg':{'$avg':'$avgp'}}} ] return pipeline

練習

習題集03

1.僅處理 FIELDS 字典中作為鍵的字段,並返回清理后的值字典列表

需求:

  1.根據 FIELDS 字典中的映射更改字典的鍵

  2.刪掉“rdf-schema#label”中的小括號里的多余說明,例如“(spider)”

  3.如果“name”為“NULL”,或包含非字母數字字符,將其設為和“label”相同的值

  4.如果字段的值為“NULL”,將其轉換為“None”

  5.如果“synonym”中存在值,應將其轉換為數組(列表),方法是刪掉“{}”字符,並根據“|” 拆分字符串。剩下的清理方式將由你自行決定,例如刪除前綴“*”等。如果存在單數同義詞,值應該依然是列表格式。    

  6.刪掉所有字段前后的空格(如果有的話)

  7.輸出結構應該如下所示

[ { 'label': 'Argiope',
    'uri': 'http://dbpedia.org/resource/Argiope_(spider)',
    'description': 'The genus Argiope includes rather large and spectacular spiders that often ...',
    'name': 'Argiope',
    'synonym': ["One", "Two"],
    'classification': {
                      'family': 'Orb-weaver spider',
                      'class': 'Arachnid',
                      'phylum': 'Arthropod',
                      'order': 'Spider',
                      'kingdom': 'Animal',
                      'genus': None
                      }
  },
  { 'label': ... , }, ...
]
import codecs
import csv
import json
import pprint
import re

DATAFILE = 'arachnid.csv'
FIELDS ={'rdf-schema#label': 'label',
         'URI': 'uri',
         'rdf-schema#comment': 'description',
         'synonym': 'synonym',
         'name': 'name',
         'family_label': 'family',
         'class_label': 'class',
         'phylum_label': 'phylum',
         'order_label': 'order',
         'kingdom_label': 'kingdom',
         'genus_label': 'genus'}


def process_file(filename, fields):
  #獲取FIELDS字典的keys列表
    process_fields = fields.keys()
    #存放結果集
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
     #跳過文件中的前3行
        for i in range(3):
            l = reader.next()
     #讀文件
        for line in reader:
            # YOUR CODE HERE
            #存放總的字典
            res = {}
            #存放key是classification的子字典
            res['classification'] = {}
            #循環FIELDS字典的keys  
            for field in process_fields:
                #獲取excel中key所對應的val,條件1
                tmp_val = line[field].strip()
                #生成json數據的新key,即是FIELDS字典的value
                new_key = FIELDS[field]
                #條件4 
                if tmp_val == 'NULL':
                    tmp_val = None
                #條件2
                if field == 'rdf-schema#label':
                    tmp_val = re.sub(r'\(.*\)','',tmp_val).strip()
                #條件3
                if field == 'name' and line[field] == 'NULL':
                    tmp_val = line['rdf-schema#label'].strip()
                #條件5
                if field == 'synonym' and tmp_val:
                    tmp_val = parse_array(line[field])
                #子字典中所包含的的key 
                if new_key in ['kingdom','family','order','phylum','genus','class']:
                    #子字典中所包含的的key的value
                    res['classification'][new_key] = tmp_val
                    continue
                #將新的key和val放入到res中,然后加入到列表中返回
                res[new_key] = tmp_val
            data.append(res)
    return data


def parse_array(v):
    #解析數組
    #如果以{開頭和}結尾,刪除左右的{},並以|進行分割,最后去除每一個項的空格,返回
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return v_array
    return [v]
def test():
    #測試函數,如果不出錯,結果正確
    data = process_file(DATAFILE, FIELDS)
    print "Your first entry:"
    pprint.pprint(data[0])
    first_entry = {
        "synonym": None, 
        "name": "Argiope", 
        "classification": {
            "kingdom": "Animal", 
            "family": "Orb-weaver spider", 
            "order": "Spider", 
            "phylum": "Arthropod", 
            "genus": None, 
            "class": "Arachnid"
        }, 
        "uri": "http://dbpedia.org/resource/Argiope_(spider)", 
        "label": "Argiope", 
        "description": "The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced."
    }

    assert len(data) == 76
    assert data[0] == first_entry
    assert data[17]["name"] == "Ogdenia"
    assert data[48]["label"] == "Hydrachnidiae"
    assert data[14]["synonym"] == ["Cyrene Peckham & Peckham"]

if __name__ == "__main__":
    test()

 2.向MonogoDB中插入數據

import json

def insert_data(data, db):

    #直接調用insert方法插入即可

    arachnids = db.arachnid.insert(data)


if __name__ == "__main__":
    
    from pymongo import MongoClient
    client = MongoClient("mongodb://localhost:27017")
    db = client.examples

    with open('arachnid.json') as f:
        data = json.loads(f.read())
        insert_data(data, db)
        print db.arachnid.find_one()

習題集04

實例文本

{
    "_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
    "elevation" : 1855,
    "name" : "Kud",
    "country" : "India",
    "lon" : 75.28,
    "lat" : 33.08,
    "isPartOf" : [
        "Jammu and Kashmir",
        "Udhampur district"
    ],
    "timeZone" : [
        "Indian Standard Time"
    ],
    "population" : 1140
}

1.找出最常見的城市名

def make_pipeline():
    #1.使用$match過濾掉name為空的數據
#2.使用$group進行對name分組,統計出每個值的和放在count中
#3.使用$sort對count進行降序排列
#4.使用$limit 1返回最后的結果
pipeline = [ {'$match':{'name':{'$ne':None}}}, {'$group':{'_id':'$name', 'count':{'$sum':1}}}, {'$sort':{'count':-1}}, {'$limit':1} ] return pipeline

2.經度在 75 到 80 之間的地區中,找出哪個地區包含的城市數量最多

def make_pipeline():
    #1.使用$match過濾出國家為India同時經度在75~80的區域
#2.使用$unwind對地區進行分割
#3.使用$group將地區進行分組,同時根據地區統計出數量
#4.使用$sort對count進行降序排列
#5.使用$limit 1返回最后的結果
pipeline = [ {'$match':{'country':'India', 'lon':{'$gte':75,'$lte':80}}}, {'$unwind':'$isPartOf'}, {'$group':{'_id':'$isPartOf', 'count':{'$sum':1}}}, {'$sort':{'count':-1}}, {'$limit':1} ] return pipeline

3.計算平均人口(本題目不是很明確,意思是要計算出每個區域的平均人口)

def make_pipeline():
    #1.使用$unwind對地區進行分割
#2.使用$group對所有的國家和地區進行分組,同時計算國家的平均人口
#3.使用$group在對國家進行分組然后在計算每個區域的平均人口即可
pipeline = [ {'$unwind':'$isPartOf'}, {'$group':{'_id':{'country':'$country', 'region':'$isPartOf'}, 'avgCityPopulation':{'$avg':'$population'}}}, {'$group':{'_id':'$_id.country', 'avgRegionalPopulation':{'$avg':'$avgCityPopulation'}}} ] return pipeline

 參考:https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM