MapReduce: map讀取文件的過程


我們的輸入文件 hello0, 內容如下:

xiaowang 28 shanghai@_@zhangsan 38 beijing@_@someone 100 unknown

 

邏輯上有3條記錄, 它們以@_@分隔.

 

我們看看數據是如何被map讀取的...

1. 默認配置

/*
 New API
  */

     //conf.set("textinputformat.record.delimiter", "@_@");
        
        /*
        job.setInputFormatClass(Format0.class);  
        //job.setInputFormatClass(Format1.class);  error here
        
        //or,
        job.setInputFormatClass(Format3.class);
        
        //job.setInputFormatClass(Format4.class); error here
        
        job.setInputFormatClass(Format5.class);
        
        */

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Test0 {


public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    public void map(Object key, Text value, Context context)  throws IOException, InterruptedException 
    {
        String line = value.toString();
        System.out.println(line);
    }
}

     
    
public static void main(String[] args) throws Exception {

         Configuration conf = new Configuration();

     
        Job job = Job.getInstance(conf);
    
        job.setJarByClass(Test0.class);
        job.setJobName("myjob");
        
    
        job.setMapperClass(MyMapper.class);
    
        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
    
        job.waitForCompletion(true);
        

        
    }

}

Debug我們可以看到value的值是獲取了文件的整個內容作為這一條記錄的值的, 因為默認情況下是以換行符作為記錄分割符的, 而文件內容中沒有換行符. map只被調用1次

 

2. 配置textinputformat.record.delimiter

我們為Configuration設置textinputformat.record.delimiter參數-

conf.set("textinputformat.record.delimiter", "@_@");

這樣map按照我們的預期讀取記錄, map被調用3次

 

3. 自定義TextInputFormat

自定義TextInputFormat, 在其RecordReader方法中設置需要的record delimiter

import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;


public class Format5 extends TextInputFormat {

    public RecordReader createRecordReader (InputSplit split, TaskAttemptContext tac) {
        byte[] recordDelimiterBytes = "@_@".getBytes();
        return new LineRecordReader(recordDelimiterBytes);
    }
    

}

應用到job上-

 job.setInputFormatClass(Format5.class);

 

這樣得到和方法2一樣的效果.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM