MongoDB聚合管道


通過上一篇文章中,認識了MongoDB中四個聚合操作,提供基本功能的count、distinct和group,還有可以提供強大功能的mapReduce。

在MongoDB的2.2版本以后,聚合框架中多了一個新的成員,聚合管道,數據進入管道后就會經過一級級的處理,直到輸出。

對於數據量不是特別大,邏輯也不是特別復雜的聚合操作,聚合管道還是比mapReduce有很多優勢的:

  1. 相比mapReduce,聚合管道比較容易理解和使用
  2. 可以直接使用管道表達式操作符,省掉了很多自定義js function,一定程度上提高執行效率
  3. 和mapReduce一樣,它也可以作用於分片集合

但是,對於數據量大,邏輯復雜的聚合操作,還是要使用mapReduce實現。

 

聚合管道

在聚合管道中,每一步操作(管道操作符)都是一個工作階段(stage),所有的stage存放在一個array中。MongoDB文檔中的描述如下:

db.collection.aggregate( [ { <stage> }, ... ] )

在聚合管道中,每一個stage都對應一個管道操作符,根據MongoDB文檔,聚合管道可以支持以下管道操作符:

Name

Description

$geoNear

Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of $match, $sort, and $limit for geospatial data. The output documents include an additional distance field and can include a location identifier field.

$group

Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.

$limit

Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).

$match

Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).

$out

Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.

$project

Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.

$redact

Reshapes each document in the stream by restricting the content for each document based on information stored in the documents themselves. Incorporates the functionality of $project and $match. Can be used to implement field level redaction. For each input document, outputs either one or zero document.

$skip

Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).

$sort

Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.

$unwind

Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.

 

下面通過具體的例子看看聚合管道的基本用法:

首先,通過以下代碼准備測試數據:

 1 var dataList = [
 2     { "name" : "Will0", "gender" : "Female", "age" : 22 , "classes": ["MongoDB", "C#", "C++"]},
 3     { "name" : "Will1", "gender" : "Female", "age" : 20 , "classes": ["Node", "JavaScript"]},
 4     { "name" : "Will2", "gender" : "Male", "age" : 24 , "classes": ["Java", "WPF", "C#"]},
 5     { "name" : "Will3", "gender" : "Male", "age" : 23 , "classes": ["WPF", "C",]},
 6     { "name" : "Will4", "gender" : "Male", "age" : 21 , "classes": ["SQL", "HTML"]},
 7     { "name" : "Will5", "gender" : "Male", "age" : 20 , "classes": ["DOM", "CSS", "HTML5"]},
 8     { "name" : "Will6", "gender" : "Female", "age" : 20 , "classes": ["PPT", "Word", "Excel"]},
 9     { "name" : "Will7", "gender" : "Female", "age" : 24 , "classes": ["HTML5", "C#"]},
10     { "name" : "Will8", "gender" : "Male", "age" : 21 , "classes": ["Java", "VB", "BASH"]},
11     { "name" : "Will9", "gender" : "Female", "age" : 24 , "classes": ["CSS"]}
12 ]              
13 
14 for(var i = 0; i < dataList.length; i++){
15     db.school.students.insert(dataList[i]);
16 }

 

$project

$project主要用於數據投影,實現字段的重命名、增加和刪除。下面例子中,重命名了"name"字段,增加了"birthYear"字段,刪除了"_id"字段。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$match": {"age": {"$lte": 22}}},
 5 ...                         {"$project": {"_id": 0, "studentName": "$name", "gender": 1, "birthYear": {"$subtract": [2014, "$age"]}}},
 6 ...                         {"$sort": {"birthYear":1}}
 7 ...                     ]
 8 ... })
 9 {
10         "result" : [
11                 {
12                         "gender" : "Female",
13                         "studentName" : "Will0",
14                         "birthYear" : 1992
15                 },
16                 {
17                         "gender" : "Male",
18                         "studentName" : "Will4",
19                         "birthYear" : 1993
20                 },
21                 {
22                         "gender" : "Male",
23                         "studentName" : "Will8",
24                         "birthYear" : 1993
25                 },
26                 {
27                         "gender" : "Female",
28                         "studentName" : "Will1",
29                         "birthYear" : 1994
30                 },
31                 {
32                         "gender" : "Male",
33                         "studentName" : "Will5",
34                         "birthYear" : 1994
35                 },
36                 {
37                         "gender" : "Female",
38                         "studentName" : "Will6",
39                         "birthYear" : 1994
40                 }
41         ],
42         "ok" : 1
43 }
44 >

 

$unwind

$unwind用來將數組拆分為獨立字段。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$match": {"age": 23}},
 5 ...                         {"$project": {"name": 1, "gender": 1, "age": 1, "classes": 1}},
 6 ...                         {"$unwind": "$classes"}
 7 ...                     ]
 8 ... })
 9 {
10         "result" : [
11                 {
12                         "_id" : ObjectId("54805220e31c9e1578ed0ccc"),
13                         "name" : "Will3",
14                         "gender" : "Male",
15                         "age" : 23,
16                         "classes" : "WPF"
17                 },
18                 {
19                         "_id" : ObjectId("54805220e31c9e1578ed0ccc"),
20                         "name" : "Will3",
21                         "gender" : "Male",
22                         "age" : 23,
23                         "classes" : "C"
24                 }
25         ],
26         "ok" : 1
27 }
28 >

 

$group

跟單個的group用法類似,用作分組操作。

使用$group時,必須要指定一個_id域,同時也可以包含一些算術類型的表達式操作符。

下面的例子就是用聚合管道實現得到男生和女生的平均年齡。

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                     {"$group":{"_id":"$gender", "avg": { "$avg": "$age" } }}
 5 ...                  ]
 6 ... })
 7 {
 8         "result" : [
 9                 {
10                         "_id" : "Male",
11                         "avg" : 21.8
12                 },
13                 {
14                         "_id" : "Female",
15                         "avg" : 22
16                 }
17         ],
18         "ok" : 1
19 }
20 >

 

查詢男生和女生的最大年齡

 1 > db.runCommand({
 2 ...     "aggregate": "school.students",
 3 ...     "pipeline": [
 4 ...                         {"$group": {"_id": "$gender", "max": {"$max": "$age"}}}
 5 ...                     ]
 6 ... })
 7 {
 8         "result" : [
 9                 {
10                         "_id" : "Male",
11                         "max" : 24
12                 },
13                 {
14                         "_id" : "Female",
15                         "max" : 24
16                 }
17         ],
18         "ok" : 1
19 }
20 >

 

 

管道操作符、管道表達式和表達式操作符

上面的例子中,已經接觸了這幾個概念,下面進行進一步介紹。

管道操作符作為"鍵",所對應的"值"叫做管道表達式,例如"{"$group": {"_id": "$gender", "max": {"$max": "$age"}}}"

  • $group是管道操作符
  • {"_id": "$gender", "max": {"$max": "$age"}}是管道表達式
  • $gender在文檔中叫做"field path",通過"$"符號來獲得特定字段,是管道表達式的一部分
  • $max是表達式操作符,正是因為聚合管道中提供了很多表達式操作符,我們可以省去很多自定義js函數

下面列出了常用的表達式操作符,更詳細的信息,請參閱MongoDB文檔

  • Boolean Expressions
    • $and, $or, $ not
  • Comparison Expressions
    • $cmp, $eq, $gt, $gte, $lt, $lte, $ne
  • Arithmetic Expressions
    • $add, $divide, $mod, $multiply, $subtract
  • String Expressions
    • $concat, $strcasecmp, $substr, $toLower, $toUpper

 

總結

聚合管道提供了一種mapReduce 的替代方案,mapReduce使用相對來說比較復雜,而聚合管道的擁有固定的接口,一系列可選擇的表達式操作符,對於大多數的聚合任務,聚合管道一般來說是首選方法。

Ps: 文章中使用的例子可以通過以下鏈接查看

http://files.cnblogs.com/wilber2013/pipeline.js

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM