hive正則
正則表達式描述了一種字符串匹配的模式,可以用來檢查一個字符串是否含有某種子串、將匹配的子串替換或者從某個串中取出符合某個條件的子串等。 正則表達式是由普通字符以及特殊字符組成的文字模式。 普通字符:包括所有大寫和小寫字母、所有數字、所有標點符號和一些其他符號 ^ 匹配輸入字符串的開始位置。 $ 匹配輸入字符串的結束位置。 [xyz] 字符集合。匹配所包含的任意一個字符。例如, '[abc]' 可以匹配 "plain" 中的 'a'。 [^xyz] 負值字符集合。匹配未包含的任意字符。例如, '[^abc]' 可以匹配 "plain" 中的'p'、'l'、'i'、'n'。 \d 匹配一個數字字符。等價於 [0-9]。 \D 匹配一個非數字字符。等價於 [^0-9]。 \w 匹配字母、數字、下划線。等價於'[A-Za-z0-9_]'。 \W 匹配非字母、數字、下划線。等價於 '[^A-Za-z0-9_]'。 . 匹配除換行符(\n、\r)之外的任何單個字符。
show tables 'e.*'; 
select ename from emp; select ename from emp where ename rlike '(IN|AR)';
查詢姓名 包含IN和AR的

select hiredate from emp;

4-1-1 只查詢格式 1981-6-9
雙斜線
select hiredate from emp where hiredate rlike '^\\d{4}-\\d-\\d$';
正則表達式替換函數:
regexp_replace(stringA,pattern,stringB)
select regexp_replace('foobar','oo|ar','77'); f77b77
select regexp_replace(ename,'IN|AR','99') from emp;
解析函數:
regexp_extract(string,pattern,index)
實例分析:
create table IF NOT EXISTS log_source ( remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_sent string, request_body string, http_referer string, http_user_agent string, http_x_forwarded_for string, host string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
列分隔符 和 數據字段中的符號 是否沖突

CREATE TABLE apachelog ( remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_sent string, request_body string, http_referer string, http_user_agent string, http_x_forwarded_for string, host string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\]]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*\) (\"[^ ]*\") (\"[^\]]*\") (\"-|[^ ]*\") (\"[^ ]*\")" ); "27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /course/view.php?id=27 HTTP/1.1" "303" "440" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com" (\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\]]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*\) (\"[^ ]*\") (\"[^\]]*\") (\"-|[^ ]*\") (\"[^ ]*\") 解決復雜格式數據導入的問題
hive查詢
數據向hive表里的多種導入方式:
load data local inpath '本地Linux文件路徑' into table tbname; 2:從hdfs上加載 load data inpath 'hdfs文件路徑' into table tbname; overwrite 覆蓋數據 3:as select 4: insert 語法格式:insert into table tbname select sql; create table emp_11 like emp; insert into table emp_11 select * from emp where deptno=10; 5:hdfs命令,直接把數據put到表的目錄下 hive表------- hdfs目錄 bin/hdfs dfs -put /home/hadoop/emp.txt /user/hive/warehouse/hadoop29.db/emp
導出數據:
1: insert overwrite 格式 insert overwrite [local] directory 'path' select sql; 導出到本地: insert overwrite local directory '/home/hadoop/nice' select * from emp where sal >2000; 默認列分割符 ---'\001' ----- '^A' insert overwrite local directory '/home/hadoop/nice' row format delimited fields terminated by '\t' select * from emp where sal >2000; 導出到hdfs: insert overwrite directory '/nice' row format delimited fields terminated by '\t' select * from emp where sal >2000; 2: bin/hive -help 查看幫助 bin/hive --database hadoop29 bin/hive -e 'use hadoop29;select * from emp;' > /home/hadoop/nice/emp.txt
hive常用的hql語句:
過濾 where select * from emp where sal >2000; limit select * from emp limit 1; distinct select distinct deptno from emp; 去重 between...and select * from emp where sal between 2000 and 3000; is null 和 is not null select * from emp where comm is not null; having 過濾分組后的數據 select deptno,round(avg(sal),2) as avg_sal from emp group by deptno having avg_sal > 2000;
hive函數
count() sum() max() min() avg() select count(*) from emp; select count(comm) from emp; 不統計null值 查看所有內置函數 show functions; 查看函數信息 desc function sum; 查看函數詳細信息 desc function extended sum; desc function extended substr; datediff 獲取當前時間戳: select unix_timestamp(); select unix_timestamp('2019-12-12 12:12:12'); 時間戳轉時間 select from_unixtime(1576123932,'yyyy-MM-dd HH:MM:SS'); 連表: join .... on 內連接: select e.empno,e.ename,d.deptno,d.dname from emp e join dept d on e.deptno=d.deptno; select e.empno,e.ename,d.deptno,d.dname from emp e left join dept d on e.deptno=d.deptno; select e.empno,e.ename,d.deptno,d.dname from emp e right join dept d on e.deptno=d.deptno; select e.empno,e.ename,d.deptno,d.dname from emp e full join dept d on e.deptno=d.deptno; 窗口函數:group by 聚合 查詢所有部門的員工信息,並且按照薪資進行降序排序 select empno,ename,sal,deptno from emp order by sal desc; 查詢所有部門的員工信息,按照部門進行薪資降序排序,並且在最后一列顯示每個部門的最高薪資 select empno,ename,sal,deptno,max(sal) over(partition by deptno order by sal desc) as max_sal from emp; 可以指定根據那個字段分區根據那個字段排序 empno ename sal deptno max_sal 7839 KING 5000.0 10 5000.0 7782 CLARK 2450.0 10 5000.0 7934 MILLER 1300.0 10 5000.0 7788 SCOTT 3000.0 20 3000.0 7902 FORD 3000.0 20 3000.0 7566 JONES 2975.0 20 3000.0 7876 ADAMS 1100.0 20 3000.0 7369 SMITH 800.0 20 3000.0 7698 BLAKE 2850.0 30 2850.0 7499 ALLEN 1600.0 30 2850.0 7844 TURNER 1500.0 30 2850.0 7654 MARTIN 1250.0 30 2850.0 7521 WARD 1250.0 30 2850.0 7900 JAMES 950.0 30 2850.0 查詢所有部門的員工信息,按照部門進行薪資降序排序,並且在最后一列顯示每個部門唯一編號 select empno,ename,sal,deptno,row_number() over(partition by deptno order by sal desc) as rn from emp; empno ename sal deptno rn 7839 KING 5000.0 10 1 7782 CLARK 2450.0 10 2 7934 MILLER 1300.0 10 3 7788 SCOTT 3000.0 20 1 7902 FORD 3000.0 20 2 7566 JONES 2975.0 20 3 7876 ADAMS 1100.0 20 4 7369 SMITH 800.0 20 5 7698 BLAKE 2850.0 30 1 7499 ALLEN 1600.0 30 2 7844 TURNER 1500.0 30 3 7654 MARTIN 1250.0 30 4 7521 WARD 1250.0 30 5 7900 JAMES 950.0 30 6 select empno,ename,sal,deptno,rank() over(partition by deptno order by sal desc) as rn from emp; empno ename sal deptno rn 7839 KING 5000.0 10 1 7782 CLARK 2450.0 10 2 7934 MILLER 1300.0 10 3 7788 SCOTT 3000.0 20 1 7902 FORD 3000.0 20 1 7566 JONES 2975.0 20 3 7876 ADAMS 1100.0 20 4 7369 SMITH 800.0 20 5 7698 BLAKE 2850.0 30 1 7499 ALLEN 1600.0 30 2 7844 TURNER 1500.0 30 3 7654 MARTIN 1250.0 30 4 7521 WARD 1250.0 30 4 7900 JAMES 950.0 30 6 select empno,ename,sal,deptno,dense_rank() over(partition by deptno order by sal desc) as rn from emp; empno ename sal deptno rn 7839 KING 5000.0 10 1 7782 CLARK 2450.0 10 2 7934 MILLER 1300.0 10 3 7788 SCOTT 3000.0 20 1 7902 FORD 3000.0 20 1 7566 JONES 2975.0 20 2 7876 ADAMS 1100.0 20 3 7369 SMITH 800.0 20 4 7698 BLAKE 2850.0 30 1 7499 ALLEN 1600.0 30 2 7844 TURNER 1500.0 30 3 7654 MARTIN 1250.0 30 4 7521 WARD 1250.0 30 4 7900 JAMES 950.0 30 5 select empno,ename,sal,deptno from (select empno,ename,sal,deptno,row_number() over(partition by deptno order by sal desc) as rn from emp) temp where rn <=2; 大數據 歷史數據
1111