hive提取字符串中域名的sql方法

需求如下：

想取如下字段里的訪問的域名：

"GET http://suo.im/4xhnBL HTTP/1.1"
"CONNECT sapi.ads.544.com:443 HTTP/1.1"
"GET http://100.110.1.52:8080/job/1/buildWithParameters?token=TOKEN&buildParams=%7B%22lintDir%22:%20%22 HTTP/1.1"	
GET http://txmov2.a.yxis.com/u/BMjAxODEwjUzNDdfM18z_B36199d79e3.mp4?tag=1-1598-unknown4849&di=d4402759&bp=10000 HTTP/1.1

一開始思考的時候直接正則匹配http，但發現匹配不到如下字符串的域名：

正則參考：https://blog.csdn.net/yong472727322/article/details/73321935

(http|https)*://(www.)?(\w+(\.)?)+

"CONNECT sapi.ads.oppomobile.com:443 HTTP/1.1"

發現hive有個函數regexp_extract很好：

正則表達式解析函數：regexp_extract

語法: regexp_extract(string subject, string pattern, int index)

返回值: string

說明：將字符串subject按照pattern正則表達式的規則拆分，返回index指定的字符。

第三個參數:

0 是顯示與之匹配的整個字符串

1 是顯示第一個括號里面的

2 是顯示第二個括號里面的字段

hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 1) from dual;
          the
hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 2) from dual;
         bar
hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 0) from dual;
        foothebar
hive> select regexp_extract('中國abc123!','[\\u4e00-\\u9fa5]+',0) from dual; //實用：只匹配中文
hive> select regexp_replace('中國abc123','[\\u4e00-\\u9fa5]+','') from dual; //實用：去掉中文

注意，在有些情況下要使用轉義字符，等號要用雙豎線轉義，這是java正則表達式的規則。

參考：

HIVE字符串處理技巧【總結】：https://zhuanlan.zhihu.com/p/82601425

hive sql 結果：

            regexp_extract(request, '\\w+ ((http[s]?)?(://))?([^/\\s]*)(/?.*) HTTP/.*', 4) as domain,
            regexp_extract(request, '"(.*?)( )', 1) as method,
            regexp_extract(request, '(HTTP/.*)(")', 1) as protocol,

坑：

hive parse_url 函數的使用

hive parse_url 只能解析標准的url

https://blog.csdn.net/oracle8090/article/details/79637982

https://blog.csdn.net/weixin_30861459/article/details/96178140

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 SQL SERVER 提取字符串中數字 java提取字符串數字，Java獲取字符串中的數字 PHP 獲取字符串中的域名 C# 提取字符串中的數字 js提取字符串中的漢字 python（15）提取字符串中的數字 C++ 提取字符串中的數字 JS提取、獲取字符串中的數字 Java提取字符串中的數字 MYSQL 提取字符串中的數字