CDH quickstart vm包含了單節點的全套hadoop服務生態,可從https://www.cloudera.com/downloads/quickstart_vms/5-13.html下載。如下:
對應的節點如下(不包含Cloudera Navigator):
要學習完整的hadoop生態,最好是使用8C/32GB以上的服務器,4C/16GB勉強能跑、但是很勉強(最好使用2個以上節點)。
impala 使用c++編寫(Spark使用Scala編寫),采用MPP架構(類似於MariaDB Columnstore,也就是之前的infinidb),由下列組件組成:
Hue是一個Web智能查詢分析器,能夠進行語法提示,查詢Impala、HDFS、HBase。如下:
其中impala服務器由Impala Daemon(執行SQL)、Impala Statestore(監控Daemon狀態)、Impala Catalog(將DDL變更傳輸給Daemon節點,避免了DDL通過Impala執行時運行REFRESH/INVALIDATE METADATA的必要,通過Hive時,仍然需要)組成。impala-shell和mysql客戶端類似,執行SQL。
Impala使用和Hive一樣的元數據,其可以存儲在mysql或postgresql中,稱為metastore。
Impala使用HDFS作為主要的存儲底層,利用其冗余特性。
Impala還支持Hbase作為存儲,通過定義映射到Hbase的表,可以查詢HBase表,還可以關聯查詢HBase和Impala表。
impala可以使用Cloudera Manager或命令行啟動:
Cloudera Manager啟動如下:
命令行啟動(這種方式啟動CM是無法監控到其狀態的,而且進程也略有不同):
service impala-state-store start/restart/stop
service impala-catalog start/restart/stop
service impala-server start/restart/stop
CM啟動后進程如下:
日志信息位於/var/log/impala,如下:
配置可通過CM修改、也可以修改配置文件/etc/default/impala。
impala客戶端
[root@quickstart impala]# impala-shell Starting Impala Shell without Kerberos authentication Connected to quickstart.cloudera:21000 Server version: impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) *********************************************************************************** Welcome to the Impala shell. (Impala Shell v2.10.0-cdh5.13.0 (2511805) built on Wed Oct 4 10:55:37 PDT 2017) The HISTORY command lists all shell commands in chronological order. *********************************************************************************** [quickstart.cloudera:21000] >
全新的Impala實例包含2個庫:default(新創建表的默認庫)以及_impala_builtins。
可通過show database/show table/select version()查看數據庫信息(語法兼容SQL 92/MySQL,大部分NoSQL的參考實現,不同於Oracle)
[quickstart.cloudera:21000] > select version(); Query: select version() Query submitted at: 2019-04-06 04:59:25 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=ef4cbbf93a7662e5:f4e103f500000000 +-------------------------------------------------------------------------------------------+ | version() | +-------------------------------------------------------------------------------------------+ | impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) | | Built on Wed Oct 4 10:55:37 PDT 2017 | +-------------------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.34s [quickstart.cloudera:21000] > show databases; Query: show databases +------------------+----------------------------------------------+ | name | comment | +------------------+----------------------------------------------+ | _impala_builtins | System database for Impala builtin functions | | default | Default Hive database | +------------------+----------------------------------------------+ Fetched 2 row(s) in 0.01s [quickstart.cloudera:21000] > show tables; Query: show tables +------+ | name | +------+ | tab1 | | tab2 | | tab3 | +------+ Fetched 3 row(s) in 0.02s
[quickstart.cloudera:21000] > select * from tab1; Query: select * from tab1 Query submitted at: 2019-04-06 05:05:57 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=dd4debde8a589622:81983b200000000 +----+-------+------------+-------------------------------+ | id | col_1 | col_2 | col_3 | +----+-------+------------+-------------------------------+ | 1 | true | 123.123 | 2012-10-24 08:55:00 | | 2 | false | 1243.5 | 2012-10-25 13:40:00 | | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | | 5 | true | 243.325 | 1953-04-22 09:11:33 | +----+-------+------------+-------------------------------+ Fetched 5 row(s) in 6.06s -- 第一次訪問特別慢,因為需要加載到內存 [quickstart.cloudera:21000] > select * from tab1; Query: select * from tab1 Query submitted at: 2019-04-06 05:06:05 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=21486d50db995bf2:a2d061b400000000 +----+-------+------------+-------------------------------+ | id | col_1 | col_2 | col_3 | +----+-------+------------+-------------------------------+ | 1 | true | 123.123 | 2012-10-24 08:55:00 | | 2 | false | 1243.5 | 2012-10-25 13:40:00 | | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | | 5 | true | 243.325 | 1953-04-22 09:11:33 | +----+-------+------------+-------------------------------+ Fetched 5 row(s) in 0.26s [quickstart.cloudera:21000] > desc tab1; Query: describe tab1 +-------+-----------+---------+ | name | type | comment | +-------+-----------+---------+ | id | int | | | col_1 | boolean | | | col_2 | double | | | col_3 | timestamp | | +-------+-----------+---------+ Fetched 4 row(s) in 0.03s [quickstart.cloudera:21000] > select count(1) from tab1; Query: select count(1) from tab1 Query submitted at: 2019-04-06 05:06:28 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9141cf84d4efd2e3:e667f1d00000000 +----------+ | count(1) | +----------+ | 5 | +----------+ Fetched 1 row(s) in 0.17s [quickstart.cloudera:21000] > create database my_first_impala_db; Query: create database my_first_impala_db Fetched 0 row(s) in 0.08s [quickstart.cloudera:21000] > create table t1 (x int); Query: create table t1 (x int) Fetched 0 row(s) in 0.08s [quickstart.cloudera:21000] > insert into t1 values (1), (3), (2), (4); --支持mysql語法 Query: insert into t1 values (1), (3), (2), (4) Query submitted at: 2019-04-06 05:07:32 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8c41d97919423a19:1f7aeeeb00000000 Modified 4 row(s) in 5.17s [quickstart.cloudera:21000] > insert into t1 select * from t1; --支持insert select Query: insert into t1 select * from t1 Query submitted at: 2019-04-06 05:07:45 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9c452b951b08da49:92d9fee100000000 Modified 4 row(s) in 0.23s
創建基於HDFS文件的表(需要先創建HDFS文件,並加載數據,參見Hadoop-HDFS學習筆記):
[quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab2; Query: drop TABLE IF EXISTS tab2 [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE tab2 > ( > id INT, > col_1 BOOLEAN, > col_2 DOUBLE > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > LOCATION '/user/cloudera/sample_data/tab2'; Query: create EXTERNAL TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/cloudera/sample_data/tab2' Fetched 0 row(s) in 0.09s [quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab3; Query: drop TABLE IF EXISTS tab3 [quickstart.cloudera:21000] > CREATE TABLE tab3 > ( > id INT, > col_1 BOOLEAN, > col_2 DOUBLE, > month INT, > day INT > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Query: create TABLE tab3 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Fetched 0 row(s) in 0.09s
包含關聯、子查詢、聚合的SQL查詢:
SELECT tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2;
查看SQL的執行計划:
explain SELECT tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2;
+------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=136.00MB | | Per-Host Resource Estimates: Memory=4.34GB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.tab1, default.tab2 | | | | PLAN-ROOT SINK | | | | | 10:EXCHANGE [UNPARTITIONED] | | | | | 05:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: tab2.col_2 = max(tab2.col_2) | | | runtime filters: RF000 <- max(tab2.col_2) | | | | | |--09:EXCHANGE [BROADCAST] | | | | | | | 08:AGGREGATE [FINALIZE] | | | | output: max:merge(tab2.col_2) | | | | group by: tab1.col_1 | | | | | | | 07:EXCHANGE [HASH(tab1.col_1)] | | | | | | | 04:AGGREGATE [STREAMING] | | | | output: max(tab2.col_2) | | | | group by: tab1.col_1 | | | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | | hash predicates: tab2.id = tab1.id | | | | runtime filters: RF001 <- tab1.id | | | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | | | 02:SCAN HDFS [default.tab1] | | | | partitions=1/1 files=1 size=192B | | | | | | | 01:SCAN HDFS [default.tab2] | | | partitions=1/1 files=1 size=158B | | | runtime filters: RF001 -> tab2.id | | | | | 00:SCAN HDFS [default.tab2] | | partitions=1/1 files=1 size=158B | | runtime filters: RF000 -> tab2.col_2 | +------------------------------------------------------------------------------------+ Fetched 41 row(s) in 0.05s
創建基於Parquet的文件,並轉換為內部分區表
[quickstart.cloudera:21000] > USE airlines_data; Query: use airlines_data [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external > LIKE PARQUET > 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq' > STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines'; Query: create EXTERNAL TABLE airlines_external LIKE PARQUET 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq' STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines' WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/impala/staging' Fetched 0 row(s) in 0.82s [quickstart.cloudera:21000] > SHOW TABLE STATS airlines_external; Query: show TABLE STATS airlines_external +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ | -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines | +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+ Fetched 1 row(s) in 3.89s [quickstart.cloudera:21000] > SHOW FILES IN airlines_external; Query: show FILES IN airlines_external +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ | Path | Size | Partition | +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | | | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | | +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+ Fetched 8 row(s) in 0.04s [quickstart.cloudera:21000] > DESCRIBE airlines_external; Query: describe airlines_external +---------------------+--------+-----------------------------+ | name | type | comment | +---------------------+--------+-----------------------------+ | year | int | Inferred from Parquet file. | | month | int | Inferred from Parquet file. | | day | int | Inferred from Parquet file. | | dayofweek | int | Inferred from Parquet file. | | dep_time | int | Inferred from Parquet file. | | crs_dep_time | int | Inferred from Parquet file. |
-- 單表查詢速度還是不錯的。。。
[quickstart.cloudera:21000] > SELECT COUNT(*) FROM airlines_external;
Query: select COUNT(*) FROM airlines_external
Query submitted at: 2019-04-06 07:08:33 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=a04cd594518cba3a:1666b4e00000000
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 0.33s
[quickstart.cloudera:21000] > SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
> NDV(origin), NDV(dest) FROM airlines_external;
Query: select NDV(carrier), NDV(flight_num), NDV(tail_num),
NDV(origin), NDV(dest) FROM airlines_external
Query submitted at: 2019-04-06 07:08:53 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=df4776d3fa8c1f69:e4e09e6100000000
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29 | 8463 | 3 | 342 | 349 |
+--------------+-----------------+---------------+-------------+-----------+
Fetched 1 row(s) in 9.33s
[quickstart.cloudera:21000] > SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
> GROUP BY tail_num;
Query: select tail_num, COUNT(*) AS howmany FROM airlines_external
GROUP BY tail_num
Query submitted at: 2019-04-06 07:09:19 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=7d43af89d18c600e:bc464f0f00000000
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
| NULL | 123122001 |
+----------+-----------+
Fetched 4 row(s) in 2.42s
-- 大表關聯,內存不足報錯 [quickstart.cloudera:21000] > SELECT DISTINCT dest FROM airlines_external > WHERE dest NOT IN (SELECT origin FROM airlines_external); Query: select DISTINCT dest FROM airlines_external WHERE dest NOT IN (SELECT origin FROM airlines_external) Query submitted at: 2019-04-06 07:09:35 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=314343c761a55f97:61ce9aa500000000 WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002 Memory left in process limit: -328.00 KB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=7): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB Exprs: Total=21.12 KB Peak=21.12 KB EXCHANGE_NODE (id=5): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB CodeGen: Total=1.86 KB Peak=347.00 KB Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB Exprs: Total=21.12 KB Peak=21.12 KB HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB EXCHANGE_NODE (id=4): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=11.65 MB DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB CodeGen: Total=12.80 KB Peak=2.00 MB Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB Buffer Pool: Free Buffers: Total=260.00 MB Buffer Pool: Clean Pages: Total=40.00 MB Buffer Pool: Unused Reservation: Total=-300.00 MB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB RequestPool=root.root: Total=0 Peak=139.93 MB RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Untracked Memory: Total=89.56 MB WARNING: The following tables are missing relevant table and/or column statistics. airlines_data.airlines_external Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002 Memory left in process limit: -328.00 KB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=7): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB Exprs: Total=21.12 KB Peak=21.12 KB EXCHANGE_NODE (id=5): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=0 DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB CodeGen: Total=1.86 KB Peak=347.00 KB Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB Exprs: Total=21.12 KB Peak=21.12 KB HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB Exprs: Total=21.12 KB Peak=21.12 KB Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB EXCHANGE_NODE (id=4): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=11.65 MB DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB CodeGen: Total=12.80 KB Peak=2.00 MB Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB Buffer Pool: Free Buffers: Total=260.00 MB Buffer Pool: Clean Pages: Total=40.00 MB Buffer Pool: Unused Reservation: Total=-300.00 MB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB RequestPool=root.root: Total=0 Peak=139.93 MB RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB Untracked Memory: Total=89.56 MB (1 of 3 similar)
impala-server:25000可以查看語句執行進度:
CREATE TABLE airlines_data.airlines (month INT, day INT, dayofweek INT, dep_time INT, crs_dep_time INT, arr_time INT, crs_arr_time INT, carrier STRING, flight_num INT, actual_elapsed_time INT, crs_elapsed_time INT, airtime INT, arrdelay INT, depdelay INT, origin STRING, dest STRING, distance INT, taxi_in INT, taxi_out INT, cancelled INT, cancellation_code STRING, diverted INT, carrier_delay INT, weather_delay INT, nas_delay INT, security_delay INT, late_aircraft_delay INT) PARTITIONED BY (year INT) STORED AS PARQUET; INSERT INTO airlines_data.airlines PARTITION (year) SELECT month, day, dayofweek, dep_time, crs_dep_time, arr_time, crs_arr_time, carrier, flight_num, actual_elapsed_time, crs_elapsed_time, airtime, arrdelay, depdelay, origin, dest, distance, taxi_in, taxi_out, cancelled, cancellation_code, diverted, carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay, year FROM airlines_data.airlines_external limit 1000000; -- 1億的時候內存不足
[quickstart.cloudera:21000] > SHOW TABLE STATS airlines; Query: show TABLE STATS airlines +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ | 1990 | 897600 | 1 | 6.62MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=1990 | | 2002 | 49152 | 1 | 527.13KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2002 | | 2003 | 53248 | 1 | 603.76KB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2003 | | Total | 1000000 | 3 | 7.73MB | 0B | | | | | +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+ Fetched 4 row(s) in 0.14s [quickstart.cloudera:21000] > COMPUTE INCREMENTAL STATS airlines; Query: compute INCREMENTAL STATS airlines WARNINGS: No partitions selected for incremental stats update Fetched 0 row(s) in 0.01s [quickstart.cloudera:21000] > COMPUTE STATS airlines; Query: compute STATS airlines +------------------------------------------+ | summary | +------------------------------------------+ | Updated 3 partition(s) and 27 column(s). | +------------------------------------------+ Fetched 1 row(s) in 2.02s
TPCDS:https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen(可生成TPC-DS測試數據集、同時包含TPC-DS測試,可生成10TB級別)
官方自帶的tpcds-kit customer數據無效了(主要是http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp鏈接無效了),腳本在https://github.com/sleberknight/impalascripts-0.6/blob/master/tpcds-setup.sh。
常見問題:
在使用Hadoop建立文件的時候,出現“Cannot create directory /user/hadoop/input. Name node is in safe mode.”問題的原因及解決方案
解決方法:https://www.waitig.com/hadoop-name-node-is-in-safe-mode.html
[cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
put: Permission denied: user=cloudera, access=WRITE, inode="/user/cloudera/sample_data/tab1":hdfs:cloudera:drwxr-xr-x
切換到HDFS用戶即可,如下:
-bash-4.1$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
-bash-4.1$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
-bash-4.1$ hdfs dfs -ls /user/cloudera/sample_data/tab1
Found 1 items
-rw-r--r-- 1 hdfs cloudera 192 2019-04-05 23:06 /user/cloudera/sample_data/tab1/tab1.csv
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/cloudera/sample_data'
報錯分析:
impala-shell運行的時候使用Impala用戶,impala對hfds路徑沒有讀寫權限;
問題處理:
方法一:對hdfs的目錄進行賦權:hadoop fs -chomd -R 777 path
-bash-4.1$ hadoop fs -chmod -R 777 /user/cloudera/sample_data
-bash-4.1$ exit
logout
方法二:創建hadoop用戶組,然后將impala加入到hadoop用戶組中,同時給impala用戶定制權限
[quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
> LIKE PARQUET
> 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
> STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
Query: create EXTERNAL TABLE airlines_external
LIKE PARQUET
'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
STORED AS PARQUET LOCATION 'hdfs:staging/airlines'
ERROR: AnalysisException: null
CAUSED BY: IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
CAUSED BY: URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
解決方法:
路徑無效,使用正確的路徑,如hdfs:/user/impala/staging/airlines/XXX