[hadoop源碼閱讀][9]-mapreduce-從wordcount開始

本文轉載自查看原文 2012-09-25 20:35 3148 hadoop原碼閱讀/ hadoop

1.wordcount的代碼如下

  
  
  
          
   
   
   
           
   
   
   
           public
   
   
   
            
   
   
   
           class
   
   
   
            WordCount { 
   
   
   
           public
   
   
   
            
   
   
   
           static
   
   
   
            
   
   
   
           class
   
   
   
            TokenizerMapper 
   
   
   
           extends
   
   
   
            Mapper
   
   
   
           <
   
   
   
           Object, Text, Text, IntWritable
   
   
   
           >
   
   
   
            { 
   
   
   
           private
   
   
   
            
   
   
   
           final
   
   
   
            
   
   
   
           static
   
   
   
            IntWritable one 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            IntWritable(
   
   
   
           1
   
   
   
           ); 
   
   
   
           private
   
   
   
            Text word 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            Text(); 
   
   
   
           public
   
   
   
            
   
   
   
           void
   
   
   
            map(Object key, Text value, Context context ) 
   
   
   
           throws
   
   
   
            IOException, InterruptedException { StringTokenizer itr 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            StringTokenizer(value.toString()); 
   
   
   
           while
   
   
   
            (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 
   
   
   
           public
   
   
   
            
   
   
   
           static
   
   
   
            
   
   
   
           class
   
   
   
            IntSumReducer 
   
   
   
           extends
   
   
   
            Reducer
   
   
   
           <
   
   
   
           Text, IntWritable, Text, IntWritable
   
   
   
           >
   
   
   
            { 
   
   
   
           private
   
   
   
            IntWritable result 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            IntWritable(); 
   
   
   
           public
   
   
   
            
   
   
   
           void
   
   
   
            reduce(Text key, Iterable
   
   
   
           <
   
   
   
           IntWritable
   
   
   
           >
   
   
   
            values,Context context) 
   
   
   
           throws
   
   
   
            IOException, InterruptedException { 
   
   
   
           int
   
   
   
            sum 
   
   
   
           =
   
   
   
            
   
   
   
           0
   
   
   
           ; 
   
   
   
           for
   
   
   
            (IntWritable val : values) { sum 
   
   
   
           +=
   
   
   
            val.get(); } result.set(sum); context.write(key, result); } } 
   
   
   
           public
   
   
   
            
   
   
   
           static
   
   
   
            
   
   
   
           void
   
   
   
            main(String[] args) 
   
   
   
           throws
   
   
   
            Exception { Configuration conf 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            Configuration(); String[] otherArgs 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            GenericOptionsParser(conf, args).getRemainingArgs(); 
   
   
   
           if
   
   
   
            (otherArgs.length 
   
   
   
           !=
   
   
   
            
   
   
   
           2
   
   
   
           ) { System.err.println(
   
   
   
           "
   
   
   
           Usage: wordcount <in> <out>
   
   
   
           "
   
   
   
           ); System.exit(
   
   
   
           2
   
   
   
           ); } Job job 
   
   
   
           =
   
   
   
            
   
   
   
           new
   
   
   
            Job(conf, 
   
   
   
           "
   
   
   
           word count
   
   
   
           "
   
   
   
           ); job.setJarByClass(WordCount.
   
   
   
           class
   
   
   
           ); job.setMapperClass(TokenizerMapper.
   
   
   
           class
   
   
   
           ); job.setCombinerClass(IntSumReducer.
   
   
   
           class
   
   
   
           ); job.setReducerClass(IntSumReducer.
   
   
   
           class
   
   
   
           ); job.setOutputKeyClass(Text.
   
   
   
           class
   
   
   
           ); job.setOutputValueClass(IntWritable.
   
   
   
           class
   
   
   
           ); FileInputFormat.addInputPath(job, 
   
   
   
           new
   
   
   
            Path(otherArgs[
   
   
   
           0
   
   
   
           ])); FileOutputFormat.setOutputPath(job, 
   
   
   
           new
   
   
   
            Path(otherArgs[
   
   
   
           1
   
   
   
           ])); System.exit(job.waitForCompletion(
   
   
   
           true
   
   
   
           ) 
   
   
   
           ?
   
   
   
            
   
   
   
           0
   
   
   
            : 
   
   
   
           1
   
   
   
           ); } }

2.一個可以運行的mapreduce程序可以包含哪些元素呢?

JobConf 常用可定制參數

參數

作用

缺省值

其它實現

InputFormat

將輸入的數據集切割成小數據集 InputSplits, 每一個 InputSplit 將由一個 Mapper 負責處理。此外 InputFormat 中還提供一個 RecordReader 的實現, 將一個 InputSplit 解析成 <key,value> 對提供給 map 函數。

TextInputFormat
(針對文本文件，按行將文本文件切割成 InputSplits, 並用 LineRecordReader 將 InputSplit 解析成 <key,value> 對，key 是行在文件中的位置，value 是文件中的一行)

SequenceFileInputFormat

OutputFormat

提供一個 RecordWriter 的實現，負責輸出最終結果

TextOutputFormat
(用 LineRecordWriter 將最終結果寫成純文件文件,每個 <key,value> 對一行，key 和 value 之間用 tab 分隔)

SequenceFileOutputFormat

OutputKeyClass

輸出的最終結果中 key 的類型

LongWritable

OutputValueClass

輸出的最終結果中 value 的類型

Text

MapperClass

Mapper 類，實現 map 函數，完成輸入的 <key,value> 到中間結果的映射

IdentityMapper
(將輸入的 <key,value> 原封不動的輸出為中間結果)

LongSumReducer,
LogRegexMapper,
InverseMapper

CombinerClass

實現 combine 函數，將中間結果中的重復 key 做合並

null
(不對中間結果中的重復 key 做合並)

ReducerClass

Reducer 類，實現 reduce 函數，對中間結果做合並，形成最終結果

IdentityReducer
(將中間結果直接輸出為最終結果)

AccumulatingReducer,

LongSumReducer

InputPath

設定 job 的輸入目錄, job 運行時會處理輸入目錄下的所有文件

null

OutputPath

設定 job 的輸出目錄，job 的最終結果會寫入輸出目錄下

null

MapOutputKeyClass

設定 map 函數輸出的中間結果中 key 的類型

如果用戶沒有設定的話，使用 OutputKeyClass

MapOutputValueClass

設定 map 函數輸出的中間結果中 value 的類型

如果用戶沒有設定的話，使用 OutputValuesClass

OutputKeyComparator

對結果中的 key 進行排序時的使用的比較器

WritableComparable

PartitionerClass

對中間結果的 key 排序后，用此 Partition 函數將其划分為R份,每份由一個 Reducer 負責處理。

HashPartitioner
(使用 Hash 函數做 partition)

KeyFieldBasedPartitioner

PipesPartitioner

比較容易疑惑的是:

InputFormat:讀取輸入文件,以自定義map的輸入數據格式,傳給map.(如下紅色字體)
public static class TokenizerMapper extends Mapper< Object, Text, Text, IntWritable>

MapOutputKeyClass,MapOutputValueClass:定義了map的輸出數據的格式,reduce的輸入數據格式

public static class TokenizerMapper extends Mapper< Object, Text, Text, IntWritable>

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>

OutputKeyClass,OutputValueClass:定義了reduce的輸出數據格式,OutputFormat的輸入格式

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>

OutputFormat:將mapreduce的結果數據寫入到文件中去

OutputKeyComparator/OutputValueGroupingComparator:二次排序用的

參考文獻:

1.http://hadoop.apache.org/docs/r0.19.1/cn/mapred_tutorial.html

2.http://caibinbupt.iteye.com/blog/338785

3.http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop2/index.html

4.http://www.riccomini.name/Topics/DistributedComputing/Hadoop/SortByValue/

5.http://blog.csdn.net/chjjunking/article/details/6747011

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop中wordcount源碼分析 mapreduce(1)--wordcount的實現 MapReduce程序（一）——wordCount 實驗6：Mapreduce實例——WordCount wordcount報錯：org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: Hadoop源碼閱讀環境搭建（IDEA） MapReduce 編程模型 & WordCount 示例 [hadoop源碼閱讀][4]-org.apache.hadoop.io [hadoop源碼閱讀][6]-org.apache.hadoop.ipc-ipc.server Eclipse執行Hadoop WordCount