Hadoop Hive概念學習系列之hive的正則表達式初步(六)


 說在前面的話

  hive的正則表達式,是非常重要!作為大數據開發人員,用好hive,正則表達式,是必須品!

 

 

 

       Hive中的正則表達式還是很強大的。數據工作者平時也離不開正則表達式。對此,特意做了個hive正則表達式的小結。所有代碼都經過親測,正常運行。

1.regexp

語法: A REGEXP B 
操作類型: strings 
描述: 功能與RLIKE相同

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'

與下面查詢的效果是等效的:

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not rlike '\\d{8}';





2.regexp_extract

語法: regexp_extract(string subject, string pattern, int index) 
返回值: string 
說明:將字符串subject按照pattern正則表達式的規則拆分,返回index指定的字符。

hive> select regexp_extract('IloveYou','I(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

Total MapReduce CPU Time Spent: 7 seconds 340 msec

OK

love

Time taken: 28.067 seconds, Fetched: 1 row(s)

 

hive> select regexp_extract('IloveYou','I(.*?)(You)',2) from test1 limit 1;

Total jobs = 1

...

OK

You

Time taken: 26.067 seconds, Fetched: 1 row(s)

 

 

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',1) from test1 limit 1;

Total jobs = 1

...

OK

I

Time taken: 26.057 seconds, Fetched: 1 row(s)

 

 

 

 

hive> select regexp_extract('IloveYou','(I)(.*?)(You)',0) from test1 limit 1;

Total jobs = 1

...

OK

IloveYou

Time taken: 28.06 seconds, Fetched: 1 row(s)

 

 

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

OK

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

 

 

 

 

3.regexp_replace

語法: regexp_replace(string A, string B, string C) 
返回值: string 
說明:將字符串A中的符合Java正則表達式B的部分替換為C。注意,在有些情況下要使用轉義字符,類似Oracle中的regexp_replace函數。

hive> select regexp_replace("IloveYou","You","") from test1 limit 1;

Total jobs = 1

...

OK

Ilove

Time taken: 26.063 seconds, Fetched: 1 row(s)

 

 

hive> select regexp_replace("IloveYou","You","lili") from test1 limit 1;

Total jobs = 1

...

OK

Ilovelili

 

 

 

 

 

 

Hive里的正則表達式
如,https://cwiki.apache.org/confluence/display/Hive/GettingStarted

輸入regex可查到


CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^]*) \[()\] ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;


下面就是hive里的正則表達式,9個字段,對應定義那邊也要9個
"input.regex" = "([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)""

([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)"
([^ ]*) ([^ ]*) ([^.]*) \\[(.*)\\] "(.*)" (-|[0-9]*) (-|[(0-9]*) \"(.*)\" \"(.*)\"

數據來源,
yarn-root-nodemanager-master.log

yarn-spark-nodemanager-master.log
yarn-hadoop-nodemanager-master.log

 


這里,有個正則表達式的好工具!
RegexBuddy.exe

 

  很好用的這款軟件!雙擊它即可。

   如上圖所示顏色,代表我們測試的正則表達式,是正確的!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM