Amazon Athena學習筆記


Amazon Athena概覽

快速了解Athena 是什么?關鍵字:

  1. 交互式查詢服務
  2. ad-hoc查詢
  3. 支持標准SQL
  4. 指定S3中的數據形成表(類似hive)
  5. 快速響應(seconds級別)
  6. serverless
  7. 支持JDBC連接和Java API連接

Amazon Athena is an interactive query service that lets you use standard SQL to analyze data directly in Amazon S3. You can point Athena at your data in Amazon S3 and run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage. You pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.

 

If you connect to Athena using the JDBC driver, use version 1.1.0 of the driver or later with the Amazon Athena API. Earlier version drivers do not support the API. For more information and to download the driver, see Accessing Amazon Athena with JDBC.

For code samples using the AWS SDK for Java, see Examples and Code Samples

 

Athena數據庫名,表名,字段名規范

  1. 數據庫名字,表名字,列名字必須是小寫

  2. 特殊字符"_"支持,其他的則不支持

  3. 如果名字以"_"開頭,則需要使用``來修飾

 

創建Athena表加載數據

1.數據在s3,創建athena表通過location參數指定加載s3上的數據

NOTE:這個好像必須創建外部表才行,后續驗證

CREATE EXTERNAL TABLE IF NOT EXISTS default.self_learning_old(rowkey STRING,windspd INT,directh INT,directv INT,func STRING,value INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://com.kong.bp.cn.test/test_folder/'

2.基於已有的表,創建分區表demo

CREATE table self_learning
WITH (format='PARQUET',
parquet_compression='SNAPPY',
partitioned_by=array['year'],
external_location = 's3://com.kong.bp.cn.test/test_folder/self_learning_old/')
AS
SELECT
      windspd,
      directh,
      directv,
      func,
      value,
     cast(substr(split(rowkey,':')[2],1,4) AS bigint) as year
FROM default.self_learning_old

 

Athena查詢json數據

關於Athena加載json數據參考文檔中的:Querying JSON

JSON樣例數據:

{
"name": "Bob Smith",
"org": "engineering",
"projects": [{
"name": "project1",
"completed": false
}, {
"name": "project2",
"completed": true
}]
}

1.使用json_extract函數解析數據:

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset

返回結果:

2.使用json_extract_scalar函數

json_extract_scalar類似json_extract函數,但是json_extract_scalar只返回scalar values (Boolean, number, or string)。

NOTE:此函數不適用於arrays, maps, or structs,這里的"scalar"我理解為對應的數據類型

比如使用json_extract_scalar解析出對應的數據:

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT
json_extract_scalar(blob, '$.name') AS name,
json_extract_scalar(blob, '$.projects') AS projects
FROM dataset

查詢的結果:

+---------------------------+
| name       | projects   |
+---------------------------+
| Susan Smith |             |
+---------------------------+

因為json中的projects是一個數組類型,所以這里使用json_extract_scalar無法識別

3.使用json_array_get函數

對於這種數組類型,可以使用json_array_get函數,比如:

WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
FROM dataset

先使用json_extract函數獲得projects項數據,得到的是一個數組類型,再使用json_array_get函數按下標(index)來獲取。返回的結果:

+---------------------------------------+
| item                                 |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM