hive內部表和外部表的區別以及應用場景


寫在前面
ps:干貨很多…
我們都知道,Hive基本上就是內部表和外部表兩種類型,在面試的時候,常會問到這種題目:
1.hive內部表和外部表的區別
2.什么時候使用內部表,什么時候使用外部表

來自官網的定義:

Managed tables
A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.

Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

External tables
An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.

Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

創建外部表的同時,語句末尾一般要自己指定 數據文件存儲路徑 location ‘/AUTO/PATH’
內部表不用特殊指定,默認為/user/hive/warehouse,
可配置:hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>

內部表數據由Hive自身管理,外部表數據由HDFS管理;
DTOP TABLE,
內部表:元數據和數據文件都會被刪除掉
外部表:元數據被刪除,數據文件任然保留 ,此時重建表都是可以的,還是可以直接查數據的
LOAD DATA ,
加載HDFS DATA 都是會將HDFS數據進行移動到對應的表目錄,類似 mv 命令

二 、使用場景
同樣來自官網的說明:

ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed tables
DROP deletes data for managed tables while it only deletes metadata for external ones
ACID/Transactional only works for managed tables
Query Results Caching only works for managed tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables

簡單來說:

每天采集的ng日志和埋點日志,在存儲的時候建議使用外部表,因為日志數據是采集程序實時采集進來的,一旦被誤刪,恢復起來非常麻煩。而且外部表方便數據的共享。

抽取過來的業務數據,其實用外部表或者內部表問題都不大,就算被誤刪,恢復起來也是很快的,如果需要對數據內容和元數據進行緊湊的管理, 那還是建議使用內部表

在做統計分析時候用到的中間表,結果表可以使用內部表,因為這些數據不需要共享,使用內部表更為合適。並且很多時候結果分區表我們只需要保留最近3天的數據,用外部表的時候刪除分區時無法刪除數據。

內部表

create  table test(
user_id string,
user_name string,
hobby array<string>,
scores map<string,int>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

外部表

create external table external_test(
user_id string,
user_name string,
hobby array<string>,
scores map<string,int>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
 location '/warehouse/db01.db';

分區表

create  table partition_test(
user_id string,
user_name string,
hobby array<string>,
scores map<string,int>
)
partitioned by (time string)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

分桶表

create  table bucket_test(
user_id string,
user_name string,
hobby array<string>,
scores map<string,int>
)
clustered by (user_name) sorted by (user_name) into 3 buckets
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

裝載數據

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

 

 

 

 

原文鏈接:https://blog.csdn.net/liuge36/article/details/111425996


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM