說在前面的話
hive的正則表達式,是非常重要!作為大數據開發人員,用好hive,正則表達式,是必須品!
Hive中的正則表達式還是很強大的。數據工作者平時也離不開正則表達式。對此,特意做了個hive正則表達式的小結。所有代碼都經過親測,正常運行。
1.regexp
語法: A REGEXP B
操作類型: strings
描述: 功能與RLIKE相同
select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'
與下面查詢的效果是等效的:
select count(*) from olap_b_dw_hotelorder_f where create_date_wid not rlike '\\d{8}';
2.regexp_extract
語法: regexp_extract(string subject, string pattern, int index)
返回值: string
說明:將字符串subject按照pattern正則表達式的規則拆分,返回index指定的字符。
hive> select regexp_extract('IloveYou','I(.*?)(You)',1) from test1 limit 1;
Total jobs = 1
...
Total MapReduce CPU Time Spent: 7 seconds 340 msec
OK
love
Time taken: 28.067 seconds, Fetched: 1 row(s)
hive> select regexp_extract('IloveYou','I(.*?)(You)',2) from test1 limit 1;
Total jobs = 1
...
OK
You
Time taken: 26.067 seconds, Fetched: 1 row(s)
hive> select regexp_extract('IloveYou','(I)(.*?)(You)',1) from test1 limit 1;
Total jobs = 1
...
OK
I
Time taken: 26.057 seconds, Fetched: 1 row(s)
hive> select regexp_extract('IloveYou','(I)(.*?)(You)',0) from test1 limit 1;
Total jobs = 1
...
OK
IloveYou
Time taken: 28.06 seconds, Fetched: 1 row(s)
hive> select regexp_replace("IloveYou","You","") from test1 limit 1;
Total jobs = 1
...
OK
Ilove
Time taken: 26.063 seconds, Fetched: 1 row(s)
3.regexp_replace
語法: regexp_replace(string A, string B, string C)
返回值: string
說明:將字符串A中的符合Java正則表達式B的部分替換為C。注意,在有些情況下要使用轉義字符,類似Oracle中的regexp_replace函數。
hive> select regexp_replace("IloveYou","You","") from test1 limit 1;
Total jobs = 1
...
OK
Ilove
Time taken: 26.063 seconds, Fetched: 1 row(s)
hive> select regexp_replace("IloveYou","You","lili") from test1 limit 1;
Total jobs = 1
...
OK
Ilovelili
Hive里的正則表達式
如,https://cwiki.apache.org/confluence/display/Hive/GettingStarted
輸入regex可查到
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^]*) \[()\] ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
下面就是hive里的正則表達式,9個字段,對應定義那邊也要9個
"input.regex" = "([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)""
([^ ]*) ([^ ]*) ([^.]*) \[(.*)\] "(.*)" (-|[0-9]*) (-|[(0-9]*) "(.*)" "(.*)"
([^ ]*) ([^ ]*) ([^.]*) \\[(.*)\\] "(.*)" (-|[0-9]*) (-|[(0-9]*) \"(.*)\" \"(.*)\"
數據來源,
yarn-root-nodemanager-master.log
或
yarn-spark-nodemanager-master.log
yarn-hadoop-nodemanager-master.log
這里,有個正則表達式的好工具!
RegexBuddy.exe
很好用的這款軟件!雙擊它即可。
如上圖所示顏色,代表我們測試的正則表達式,是正確的!