Hive 特殊分隔符處理


HIVE特殊分隔符處理

Hive對文件中的分隔符默認情況下只支持單字節分隔符,,默認單字符是\001。當然你也可以在創建表格時指定數據的分割符號。但是如果數據文件中的分隔符是多字符的,如下圖:

01||zhangsan

02||lisi

03||wangwu

 

補充:hive讀取數據的機制

1、首先用inputformat的一個具體的實現類讀取文件數據,返回一條條的記錄(可以是行,或者是你邏輯中的“行”)

2、然后利用SerDe<默認:org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一個具體的實現類,對上面返回的一條條記錄進行字段切割

 

使用RegexSerDe通過正則表達式來抽取字段

1、建表

create table t_bi_reg(id string,name string)

row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

with serdeproperties(

'input.regex'='(.*)\\|\\|(.*)',

'output.format.string'='%1$s%2$s'

)

stored as textfile;

 

2、加載數據

01||zhangsan

02||lisi

03||wangwu

 

load data local inpath '/root/lianggang.txt' into table t_bi_reg;

3、查詢

hive> select * from t_bi_reg;

OK

01      zhangsan

02      lisi

03      wangwu

 

 

通過自定義inputformat解決特殊分隔符問題

其原理是在inputformat讀取行的時候將數據中的“多字節分隔符”替換為hive默認的分隔符(ctrl+A 亦即 \001)或用於替代的單字符分隔符。以便hive在serde操作的時候按照默認的單字節分隔符進行字段抽取

package cn.gec.bigdata.hive.inputformat;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileSplit;

import org.apache.hadoop.mapred.InputSplit;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.LineRecordReader;

import org.apache.hadoop.mapred.RecordReader;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

 

public class BiDelimiterInputFormat extends TextInputFormat {

         @Override

         public RecordReader<LongWritable, Text> getRecordReader(

 

         InputSplit genericSplit, JobConf job, Reporter reporter)

         throws IOException {

                   reporter.setStatus(genericSplit.toString());

                   MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));

                   // BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);

                   return reader;

         }

 

         public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {

                   LineRecordReader reader;

                   Text text;

                   public MyDemoRecordReader(LineRecordReader reader) {

                            this.reader = reader;

                            text = reader.createValue();

                   }

                   @Override

                   public void close() throws IOException {

                            reader.close();

                   }

                   @Override

                   public LongWritable createKey() {

                            return reader.createKey();

                   }

                   @Override

                   public Text createValue() {

                            return new Text();

                   }

 

                   @Override

                   public long getPos() throws IOException {

                            return reader.getPos();

                   }

 

                   @Override

                   public float getProgress() throws IOException {

                            return reader.getProgress();

                   }

 

                   @Override

                   public boolean next(LongWritable key, Text value) throws IOException {  

                            boolean next = reader.next(key, text);

                            if(next){

                                     String replaceText = text.toString().replaceAll("\\|\\|", "\\|");

                                     value.set(replaceText);

                            }

                            return next;         

                   }

         }

 

}

 

1.打包成jar,放到$HIVE_HOME/lib下

2.建表指明自定義的inputformat

create table t_lianggang(id string,name string)

row format delimited

fields terminated by '|'

stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat'

outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

 

3.加載數據

01||zhangsan

02||lisi

03||wangwu

 

load data local inpath '/root/lianggang.txt' into table t_lianggang;

 

4.查詢

hive> select * from t_lianggang;

OK

01      zhangsan

02      lisi

03      wangwu

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM