Hive 特殊分隔符處理

本文轉載自查看原文 2019-03-30 18:59 1346 Hive

HIVE特殊分隔符處理

Hive對文件中的分隔符默認情況下只支持單字節分隔符，,默認單字符是\001。當然你也可以在創建表格時指定數據的分割符號。但是如果數據文件中的分隔符是多字符的，如下圖：

01||zhangsan

02||lisi

03||wangwu

補充：hive讀取數據的機制

1、首先用inputformat的一個具體的實現類讀取文件數據，返回一條條的記錄（可以是行，或者是你邏輯中的“行”）

2、然后利用SerDe<默認：org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一個具體的實現類，對上面返回的一條條記錄進行字段切割

使用RegexSerDe通過正則表達式來抽取字段

1、建表

create table t_bi_reg(id string,name string)

row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

with serdeproperties(

'input.regex'='(.*)\\|\\|(.*)',

'output.format.string'='%1$s%2$s'

)

stored as textfile;

2、加載數據

01||zhangsan

02||lisi

03||wangwu

load data local inpath '/root/lianggang.txt' into table t_bi_reg;

3、查詢

hive> select * from t_bi_reg;

01 zhangsan

02 lisi

03 wangwu

通過自定義inputformat解決特殊分隔符問題

其原理是在inputformat讀取行的時候將數據中的“多字節分隔符”替換為hive默認的分隔符（ctrl+A 亦即 \001）或用於替代的單字符分隔符。以便hive在serde操作的時候按照默認的單字節分隔符進行字段抽取

package cn.gec.bigdata.hive.inputformat;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileSplit;

import org.apache.hadoop.mapred.InputSplit;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.LineRecordReader;

import org.apache.hadoop.mapred.RecordReader;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

public class BiDelimiterInputFormat extends TextInputFormat {

@Override

public RecordReader<LongWritable, Text> getRecordReader(

InputSplit genericSplit, JobConf job, Reporter reporter)

throws IOException {

reporter.setStatus(genericSplit.toString());

MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));

// BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);

return reader;

}

public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {

LineRecordReader reader;

Text text;

public MyDemoRecordReader(LineRecordReader reader) {

this.reader = reader;

text = reader.createValue();

}

@Override

public void close() throws IOException {

reader.close();

}

@Override

public LongWritable createKey() {

return reader.createKey();

}

@Override

public Text createValue() {

return new Text();

}

@Override

public long getPos() throws IOException {

return reader.getPos();

}

@Override

public float getProgress() throws IOException {

return reader.getProgress();

}

@Override

public boolean next(LongWritable key, Text value) throws IOException {

boolean next = reader.next(key, text);

if(next){

String replaceText = text.toString().replaceAll("\\|\\|", "\\|");

value.set(replaceText);

}

return next;

}

1.打包成jar，放到$HIVE_HOME/lib下

2.建表指明自定義的inputformat

create table t_lianggang(id string,name string)

row format delimited

fields terminated by '|'

stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat'

outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

3.加載數據

01||zhangsan

02||lisi

03||wangwu

load data local inpath '/root/lianggang.txt' into table t_lianggang;

4.查詢

hive> select * from t_lianggang;

01 zhangsan

02 lisi

03 wangwu

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive分隔符_HIVE-默認分隔符的（linux系統的特殊字符）查看，輸入和修改 hive分隔符總結 Hive 默認分隔符 HIVE-默認分隔符的（linux系統的特殊字符）查看，輸入和修改 HIVE-默認分隔符以及linux系統中特殊字符的輸入和查看方式 Hive建表-分隔符 Hive數據導入和分隔符 hive sql split 分隔符 SQL 沒有分隔符的日期處理（比如20220101） [Hive_3] Hive 建表指定分隔符