Hadoop開發常用的InputFormat和OutputFormat

本文轉載自查看原文 2012-04-23 00:47 6932 hadoop_基礎/ hadoop

在用hadoop的streaming讀數據時，如果輸入是sequence file，如果用“-inputformat org.apache.hadoop.mapred.SequenceFileInputFormat”配置讀的話，讀入的數據顯示的話為亂碼，其實是因為讀入的還是sequence file格式的，包括sequencefile的頭信息在內.改為“inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat”即可正常讀取。

以下內容摘自其他地方，對inputformat和outputformat的一個粗略的介紹：

Hadoop中的Map Reduce框架依賴InputFormat提供數據，依賴OutputFormat輸出數據；每一個Map Reduce程序都離不開他們。Hadoop提供了一系列InputFormat和OutputFormat方便開發，本文介紹幾種常用的。

TextInputFormat
用於讀取純文本文件，文件被分為一系列以LF或者CR結束的行，key是每一行的位置（偏移量,LongWritable類型），value是每一行的內容,Text類型。

KeyValueTextInputFormat
同樣用於讀取文件，如果行被分隔符（缺省是tab）分割為兩部分，第一部分為key，剩下的部分為value；如果沒有分隔符，整行作為 key，value為空

SequenceFileInputFormat
用於讀取sequence file。 sequence file是Hadoop用於存儲數據自定義格式的binary文件。它有兩個子類：SequenceFileAsBinaryInputFormat，將 key和value以BytesWritable的類型讀出；

SequenceFileAsTextInputFormat，將key和value以 Text的類型讀出。

SequenceFileInputFilter
根據filter從sequence文件中取得部分滿足條件的數據，通過setFilterClass指定Filter，內置了三種 Filter，RegexFilter取key值滿足指定的正則表達式的記錄；PercentFilter通過指定參數f，取記錄行數%f==0的記錄；MD5Filter通過指定參數f，取MD5(key)%f==0的記錄。

NLineInputFormat
0.18.x新加入，可以將文件以行為單位進行split，比如文件的每一行對應一個map。得到的key是每一行的位置（偏移量,LongWritable類型），value是每一行的內容,Text類型。

CompositeInputFormat，用於多個數據源的join。

TextOutputFormat，輸出到純文本文件，格式為 key + ” ” + value。

NullOutputFormat，hadoop中的/dev/null，將輸出送進黑洞。

SequenceFileOutputFormat，輸出到sequence file格式文件。

MultipleSequenceFileOutputFormat, MultipleTextOutputFormat，根據key將記錄輸出到不同的文件。

DBInputFormat和DBOutputFormat，從DB讀取，輸出到DB，預計將在0.19版本加入。

轉自 http://www.cnblogs.com/xuxm2007/archive/2011/09/01/2161974.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop權威指南: InputFormat,RecordReader,OutputFormat和RecordWriter InputFormat和OutPutFormat [Hadoop] - 自定義Mapreduce InputFormat&OutputFormat 自定義InputFormat和OutputFormat案例 Hadoop文件分片與InputFormat hadoop輸入格式(InputFormat) [Hadoop源碼詳解]之一MapReduce篇之InputFormat Hadoop案例（六）小文件處理（自定義InputFormat） Hadoop案例（五）過濾日志及自定義日志輸出路徑（自定義OutputFormat) Hadoop開發