HIVE特殊分隔符處理
Hive對文件中的分隔符默認情況下只支持單字節分隔符,,默認單字符是\001。當然你也可以在創建表格時指定數據的分割符號。但是如果數據文件中的分隔符是多字符的,如下圖:
01||zhangsan
02||lisi
03||wangwu
補充:hive讀取數據的機制
1、首先用inputformat的一個具體的實現類讀取文件數據,返回一條條的記錄(可以是行,或者是你邏輯中的“行”)
2、然后利用SerDe<默認:org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一個具體的實現類,對上面返回的一條條記錄進行字段切割
使用RegexSerDe通過正則表達式來抽取字段
1、建表
create table t_bi_reg(id string,name string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties( 'input.regex'='(.*)\\|\\|(.*)', 'output.format.string'='%1$s%2$s' ) stored as textfile; |
2、加載數據
01||zhangsan 02||lisi 03||wangwu
load data local inpath '/root/lianggang.txt' into table t_bi_reg; |
3、查詢
hive> select * from t_bi_reg; OK 01 zhangsan 02 lisi 03 wangwu |
通過自定義inputformat解決特殊分隔符問題
其原理是在inputformat讀取行的時候將數據中的“多字節分隔符”替換為hive默認的分隔符(ctrl+A 亦即 \001)或用於替代的單字符分隔符。以便hive在serde操作的時候按照默認的單字節分隔符進行字段抽取
package cn.gec.bigdata.hive.inputformat; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.LineRecordReader; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat;
public class BiDelimiterInputFormat extends TextInputFormat { @Override public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit)); // BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit); return reader; }
public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> { LineRecordReader reader; Text text; public MyDemoRecordReader(LineRecordReader reader) { this.reader = reader; text = reader.createValue(); } @Override public void close() throws IOException { reader.close(); } @Override public LongWritable createKey() { return reader.createKey(); } @Override public Text createValue() { return new Text(); }
@Override public long getPos() throws IOException { return reader.getPos(); }
@Override public float getProgress() throws IOException { return reader.getProgress(); }
@Override public boolean next(LongWritable key, Text value) throws IOException { boolean next = reader.next(key, text); if(next){ String replaceText = text.toString().replaceAll("\\|\\|", "\\|"); value.set(replaceText); } return next; } }
}
|
1.打包成jar,放到$HIVE_HOME/lib下
2.建表指明自定義的inputformat
create table t_lianggang(id string,name string) row format delimited fields terminated by '|' stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; |
3.加載數據
01||zhangsan 02||lisi 03||wangwu
load data local inpath '/root/lianggang.txt' into table t_lianggang; |
4.查詢
hive> select * from t_lianggang; OK 01 zhangsan 02 lisi 03 wangwu
|