以WordCount程序為例,假設有三台DataNode,每台DataNode有不一樣的數據,如下表格所示:
DataNode1
|
DataNode2
|
DataNode3
|
who are you are
|
who am i are
|
who is he am
|
經過Map函數后,生成以下鍵值對:
DataNode1
|
DataNode2
|
DataNode3
|
who 1
are 1
you 1
are 1
|
who 1
am 1
i 1
are 1
|
who 1
is 1
he 1
am 1
|
然后按照key值排序,變成以下鍵值對:
DataNode1
|
DataNode2
|
DataNode3
|
are 1
are 1
who 1
you 1
|
am 1
are 1
i 1
who 1
|
am 1
he 1
is 1
who 1
|
如果有Combiner函數的話,則把相同的key進行計算,我們可以吧Combiner函數當做一個miniReduce函數:
DataNode1
|
DataNode2
|
DataNode3
|
are 2
who 1
you 1
|
am 1
are 1
i 1
who 1
|
am 1
he 1
is 1
who 1
|
如果有Partition函數的話,則進行分區,分幾個區就有幾個Reducer同時進行運算,然后就會生成幾個不一樣的結果文件;默認只有一個Reducer進行工作。
這里先講一個Reducer的情況,數據先從三個DataNode中Copy過來,然后Merge到Reducer中去:
Reducer
|
are 2
who 1
you 1
am 1
are 1
i 1
who 1
am 1
he 1
is 1
who 1
|
然后對數據按照key進行排序(Sort),Copy,Merge,Sort過程統稱為Shuffle過程:
Reducer
|
am 1
am 1
are 2
are 1
he 1
i 1
is 1
you 1
who 1
who 1
who 1
|
然后數據經過Reduce函數后,生成以下輸出文件:
Reducer
|
am 2
are 3
he 1
i 1
is 1
you 1
who 3
|
到這里為止,整個MapReduce過程也就完成了。
如果有多個Reducer的話,不同的是數據會分開Copy到不同的機器中,也就是分開計算,然后Copy到每個Reducer中的數據都會經過Merge,Sort,Reduce過程,最后每個Reducer都會生成一個結果文件。