我們的輸入文件 hello0, 內容如下:
xiaowang 28 shanghai@_@zhangsan 38 beijing@_@someone 100 unknown
邏輯上有3條記錄, 它們以@_@分隔.
我們看看數據是如何被map讀取的...
1. 默認配置
/* New API */ //conf.set("textinputformat.record.delimiter", "@_@"); /* job.setInputFormatClass(Format0.class); //job.setInputFormatClass(Format1.class); error here //or, job.setInputFormatClass(Format3.class); //job.setInputFormatClass(Format4.class); error here job.setInputFormatClass(Format5.class); */ import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Test0 { public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); System.out.println(line); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(Test0.class); job.setJobName("myjob"); job.setMapperClass(MyMapper.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
Debug我們可以看到value的值是獲取了文件的整個內容作為這一條記錄的值的, 因為默認情況下是以換行符作為記錄分割符的, 而文件內容中沒有換行符. map只被調用1次
2. 配置textinputformat.record.delimiter
我們為Configuration設置textinputformat.record.delimiter參數-
conf.set("textinputformat.record.delimiter", "@_@");
這樣map按照我們的預期讀取記錄, map被調用3次
3. 自定義TextInputFormat
自定義TextInputFormat, 在其RecordReader方法中設置需要的record delimiter
import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; public class Format5 extends TextInputFormat { public RecordReader createRecordReader (InputSplit split, TaskAttemptContext tac) { byte[] recordDelimiterBytes = "@_@".getBytes(); return new LineRecordReader(recordDelimiterBytes); } }
應用到job上-
job.setInputFormatClass(Format5.class);
這樣得到和方法2一樣的效果.