Hadoop編程筆記(一):Mapper及Reducer類詳解


  本《hadoop編程筆記》系列主要針對Hadoop編程方面的學習,包括主要類及接口的用法和作用以及編程方法,最佳實踐等,如果你想更多的學習Hadoop本身所具有的特性和功能及其附屬的生態圈(如Pig,Hive,Hbase等),請參閱另一個筆記系列《Hadoop學習筆記》,俺深知自己能力有限,寫的不對的地方還望各位海涵,同時給俺指點一二~~
  本文說明:本文來源於hadoop1.0.4 API

1. Mapper

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Map任務是一個轉換輸入記錄到某個中間記錄的獨立任務,被轉換后的中間記錄不需要與輸入記錄具有相同的類型。一個給定的輸入鍵值對可能會map到零個或多個輸出鍵值對。

The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapperimplementations can access the Configuration for the job via the JobContext.getConfiguration().
InputFormat會為對應的作業產生一個或多個InputSplit,Hadoop Map-Reduce框架再為每個InputSplit產生一個Map任務。Mapper的實現類可以通過JobContext.getConfiguration()來獲得此作業的Configuration對象(里面保存了各種job的配置信息)。

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.
Hadoop Map-Reduce框架首先調用setup(org.apache.hadoop.mapreduce.Mapper.Context)建立工作環境,然后為每個InputSplit中的鍵值對調用map(Object, Object, Context)函數處理輸入數據,map任務完成后再調用cleanup(Context)做些清理工作。 

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.
中間輸出鍵(key)相同的所有中間值(value)隨后都會被框架分組,然后傳輸給reducer來決定最終的輸出。用戶可以通過指定兩個key RawComparator類來控制中間數據的排序和分組。 

The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
Mapper的輸出會為每個Reduceer分區,用戶可以通過實現一個自定義的Partitioner來控制哪個鍵(因此也包括了對應記錄)流向那個Reducer。

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
用戶還可以通過Job.setCombinerClass(Class)來指定一個combiner(合並器)來執行中間輸出數據的本地合並,從而可以減少從Mapper到Reducer的網絡上的傳輸數據量。

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.
應用程序可以通過Configuration來指定中間輸出數據是否和如何被壓縮,以及使用哪個CompressionCodec來進行壓縮。

If the job has zero reduces then the output of the Mapper is directly written to the OutputFormat without sorting by keys.
如果此作業沒有reduce任務,那么Mapper的輸出不經過關鍵字排序就會直接被寫入到OutputFormat中去。

Mapper抽象類除了上面介紹的setup(),map(),cleanup()方法外,還有另一個方法:public void run(Mapper.Context context) , 用戶可以重載這個方法來實現更多的對Mapper的控制。

2. Reducer

Reducer的實現類可以通過JobContext.getConfiguration()來獲得此作業的Configuration對象(里面保存了各種job的配置信息)。

Reducer有三個主要的階段:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.
Reducer把來自Mapper的已排序的輸出數據通過網絡經Http拷貝到本地來

 2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
MapReduce框架按關鍵字(key)對Reducer輸入進行融合和排序(因為不同的Mapper可能會輸出同樣的key給某個reducer)

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
shuffle和sort階段可以同時進行,例如map輸出數據在傳輸時可以同時被融合。

2.1 SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class).
要想對值迭代器返回的值組合獲得第二次排序,應用程序應該擴展關鍵字,即使用一個“第二關鍵字”(secondary key),然后再定義一個組比較器,這樣的話關鍵字的排序會由整個關鍵字組決定,但會通過組比較器來進行分組從而決定了哪些鍵值對的集合被送往同一個reducer進行處理。組比較器可以通過Job.setGroupingComparatorClass(Class)來指定,排序的順序通過Job.setSortComparatorClass(Class)控制。

For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
例如,如果你想找到某些內容重復的web頁面,然后全部給他們賦予這些頁面中最受歡迎頁面的URL,你可以這樣來設置作業: 

      • Map Input Key: url
      • Map Input Value: document
      • Map Output Key: document checksum, url pagerank
      • Map Output Value: url
      • Partitioner: by checksum
      • OutputKeyComparator: by checksum and then decreasing pagerank
      • OutputValueGroupingComparator: by checksum
      注:pagerank(佩奇等級)是衡量一個網頁的重要程度和受歡迎程度的一項重要指標

 3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.
在reduce這個階段會為已排序reduce輸入中的每個<key, (collection of values)>調用reduce(Object, Iterable, Context)

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
通常情況下,reduce任務的輸出會通過TaskInputOutputContext.write(Object, Object)寫入到一個RecordWriter

The output of the Reducer is not re-sorted.
reducer的輸出是不會重排序的。

本文系原創,轉載請注明出處:http://www.cnblogs.com/beanmoon/archive/2012/12/06/2804594.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM