hive外部表自動讀取文件夾里的數據

本文轉載自查看原文 2016-10-17 15:02 2012 hive

我們在創建表的時候可以指定external關鍵字創建外部表,外部表對應的文件存儲在location指定的目錄下,向該目錄添加新文件的同時，該表也會讀取到該文件(當然文件格式必須跟表定義的一致)，刪除外部表的同時並不會刪除location指定目錄下的文件.

1.查看hdfs系統目錄/user/hadoop1/myfile下文件

[hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
Found 1 items
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt

2.創建外部表指向myfile目錄下的文件

hive (hxl)> create external table tb_class_info_external
          > (id int,
          > class_name string,
          > createtime timestamp ,
          > modifytime timestamp)
          > ROW FORMAT DELIMITED
          > FIELDS TERMINATED BY '|'
          > location '/user/hadoop1/myfile';
OK
Time taken: 0.083 seconds

注意這里的location指向的是hdfs系統上的路徑,而不是本地機器上的路徑,這里表tb_class_info_external會讀取myfile目錄下的所有文件

3.查看外部表

hive (hxl)> select count(1) from tb_class_info_external;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201410300915_0009, Tracking URL = http://node1:50030/jobdetails.jsp?jobid=job_201410300915_0009
Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=http://192.168.56.101:9001 -kill job_201410300915_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-30 15:25:10,652 Stage-1 map = 0%,  reduce = 0%
2014-10-30 15:25:12,664 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:13,671 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:14,682 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:15,690 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:16,697 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:17,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:18,710 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:19,718 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:25:20,725 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
2014-10-30 15:25:21,730 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
2014-10-30 15:25:22,737 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
MapReduce Total cumulative CPU time: 1 seconds 210 msec
Ended Job = job_201410300915_0009
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.21 sec   HDFS Read: 568052 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 210 msec
OK
10001
Time taken: 14.742 seconds

可以看到這里表記錄數是10001,下面我們在myfile目錄下添加另外一個文件tb_class_bak.txt

4.在myfile目錄下添加文本

$hadoop fs -cp /user/hadoop1/myfile/tb_class.txt /user/hadoop1/myfile/tb_class_bak.txt

5.再次查詢表記錄數

hive (hxl)> select count(1) from tb_class_info_external;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201410300915_0010, Tracking URL = http://node1:50030/jobdetails.jsp?jobid=job_201410300915_0010
Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=http://192.168.56.101:9001 -kill job_201410300915_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-30 15:32:02,275 Stage-1 map = 0%,  reduce = 0%
2014-10-30 15:32:04,286 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:05,292 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:06,300 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:07,306 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:08,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:09,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:10,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:11,331 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
2014-10-30 15:32:12,338 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
2014-10-30 15:32:13,343 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
2014-10-30 15:32:14,350 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
MapReduce Total cumulative CPU time: 1 seconds 160 msec
Ended Job = job_201410300915_0010
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.16 sec   HDFS Read: 1135971 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 160 msec
OK
20002
Time taken: 14.665 seconds

可以看到記錄數加倍了,那就說明表已經讀取了新增加的文件

6.刪除表

hive (hxl)> drop table tb_class_info_external;
OK
Time taken: 1.7 seconds

表對應的文件並沒有刪除

[hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
Found 2 items
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt
-rw-r--r--   3 hadoop1 supergroup     567839 2014-10-30 15:28 /user/hadoop1/myfile/tb_class_bak.txt

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive外部表的建立與數據匹配 matlab批量讀取一個文件夾里類似命名的mat文件 Python讀取Excel批量自動創建Hive數據表SQL hive表元數據讀取不到 diff兩個文件夾里的東西 windows下刪除文件夾里的 .svn Hive內部表和外部表記錄一個hive清空外部表數據的辦法利用Python讀取外部數據文件讀取hive文件並將數據導入hbase