Hadoop Hive概念學習系列之hive的正則表達式初步（六）

本文轉載自查看原文 2016-11-25 19:35 21427 Hadoop Hive概念學習系列

說在前面的話

　　hive的正則表達式，是非常重要！作為大數據開發人員，用好hive，正則表達式，是必須品！

Hive中的正則表達式還是很強大的。數據工作者平時也離不開正則表達式。對此，特意做了個hive正則表達式的小結。所有代碼都經過親測，正常運行。

1.regexp

語法: A REGEXP B
操作類型: strings
描述: 功能與RLIKE相同

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'

與下面查詢的效果是等效的：

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not rlike '\\d{8}';

2.regexp_extract

語法: regexp_extract(string subject, string pattern, int index)
返回值: string
說明：將字符串subject按照pattern正則表達式的規則拆分，返回index指定的字符。

hive> select regexp_extract('IloveYou','I(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

Total MapReduce CPU Time Spent: 7 seconds 340 msec

love

Time taken: 28.067 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','I(.*?)(You)',2) from test1 limit 1;

Total jobs = 1

...

You

Time taken: 26.067 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

Time taken: 26.057 seconds, Fetched: 1 row(s)

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',0) from test1 limit 1;

Total jobs = 1

...

IloveYou

Time taken: 28.06 seconds, Fetched: 1 row(s)

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

3.regexp_replace

語法: regexp_replace(string A, string B, string C)
返回值: string
說明：將字符串A中的符合Java正則表達式B的部分替換為C。注意，在有些情況下要使用轉義字符,類似Oracle中的regexp_replace函數。

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

hive> select regexp_replace("IloveYou","You","lili") from test1 limit 1;

Total jobs = 1

...

Ilovelili

Hive里的正則表達式
如，https://cwiki.apache.org/confluence/display/Hive/GettingStarted

輸入regex可查到

CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^]*) \[()\] ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

下面就是hive里的正則表達式，9個字段，對應定義那邊也要9個
"input.regex" = "([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)""

([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)"
([^ ]*) ([^ ]*) ([^.]*) \\[(.*)\\] "(.*)" (-|[0-9]*) (-|[(0-9]*) \"(.*)\" \"(.*)\"

數據來源，
yarn-root-nodemanager-master.log
或
yarn-spark-nodemanager-master.log
yarn-hadoop-nodemanager-master.log

這里，有個正則表達式的好工具！
RegexBuddy.exe

　　很好用的這款軟件！雙擊它即可。

　　如上圖所示顏色，代表我們測試的正則表達式，是正確的！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop Hive概念學習系列之什么是Hive？ Hadoop Hive概念學習系列之hive里的分區（九） Hadoop Hive概念學習系列之hive里的視圖（十二） Hadoop Hive概念學習系列之hive里的索引（十三） hive正則表達式的用法 hive正則表達式正則表達式之基本概念 Hadoop Hive概念學習系列之hive里如何顯示當前數據庫及傳參（十九） Hadoop Hive概念學習系列之hive里的用戶定義函數UDF（十七） hive 正則表達式匹配中文