1. 首先下載測試數據,數據也可以創建
http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
2. 數據類型與字段名稱
movies.csv(電影元數據)
movieId,title,genres
ratings.csv(用戶打分數據)
userId,movieId,rating,timestamp
3. 先把數據存放到HDFS上
hdfs dfs -mkdir /hive_operate hdfs dfs -mkdir /hive_operate/movie_table hdfs dfs -mkdir /hive_operate/rating_table hdfs dfs -put movies.csv /hive_operate/movie_table hdfs dfs -put ratings.csv /hive_operate/rating_table
4. 創建movie_table和rating_table
]$ cat create_movie_table.sql create external table movie_table ( movieId STRING, title STRING, genres STRING ) row format delimited fields terminated by ',' stored as textfile location '/hive_operate/movie_table'; ]$ cat create_rating_table.sql create external table rating_table (userId STRING, movieId STRING, rating STRING, ts STRING ) row format delimited fields terminated by ',' stored as textfile location '/hive_operate/rating_table'; 其中字段名為timestamp為hive的保留字段,執行的時候會報錯,需用反引號或者修改字段名,我這邊修改的字段名
5. 執行
可以通過復制命令到終端執行,也可以通過hive -f movie_table_e來創建表
6. 查看
hive> show tables; OK movie_table rating_table hive> select * from rating_table limit 10; OK 1 31 2.5 1260759144 1 1029 3.0 1260759179 1 1061 3.0 1260759182 1 1129 2.0 1260759185 1 1172 4.0 1260759205 1 1263 2.0 1260759151 1 1287 2.0 1260759187 1 1293 2.0 1260759148 1 1339 3.5 1260759125 1 1343 2.0 1260759131
7. 生成新表(行為表)
create table behavior_table as select B.userid, A.movieid, B.rating, A.title from movie_table A join rating_table B on A.movieid == B.movieid;
8. 把Hive表數據導入到本地
table->local file insert overwrite local directory '/root/hive_test/1.txt' select * from behavior_table;
9. 把Hive表數據導入到HDFS上
table->hdfs file insert overwrite directory '/root/hive_test/1.txt' select * from behavior_table;
10. 把本地數據導入到Hive表中
local file -> table LOAD DATA LOCAL INPATH '/root/hive_test/a.txt' OVERWRITE INTO TABLE behavior_table;
11. 把HDFS上的數導入到HIve表中
hdfs file -> table LOAD DATA INPATH '/a.txt' OVERWRITE INTO TABLE behavior_table;
