從MySQL到Hive，數據遷移就這么簡單

本文轉載自查看原文 2017-07-05 16:43 6568 Hive

使用Sqoop能夠極大簡化MySQL數據遷移至Hive之流程，並降低Hadoop處理分析任務時的難度。

先決條件:安裝並運行有Sqoop與Hive的Hadoop環境。為了加快處理速度，我們還將使用Cloudera Quickstart VM(要求至少4 GB內存)，不過大家也可以使用Hortonworks Data Platform(至少要求8 GB內存)。由於我的筆記本電腦只有8 GB內存，因此我在這里使用CLoudera VM鏡像。

如果大家需要使用Virtualbox運行Cloudera/HDP VM，則可以輕松使用其它多種Hadoop生態系統預裝軟件包(包括MySQL、Oozie、Hadoop、Hive、Zookeeper、Storm、Kafka以及Spark等等)。

在MySQL中創建表

在Cloudera VM中，打開命令界面並確保MySQL已經安裝完畢。

  
  
  
          
   
   
   
           shell> mysql --version 
   
   
   
           mysql  Ver 14.14 Distrib 5.1.66, for redhat-linux-gnu (x86_64) using readline 5.

示例當中自然要使用自己的數據庫，因此使用以下命令在MySQL中創建一套數據庫：

  
  
  
          
   
   
   
           mysql> create database sqoop;

接下來：

  
  
  
          
   
   
   
           mysql> use sqoop; 
   
   
   
           mysql> create table customer(id varchar(3), name varchar(20), age varchar(3), salary integer(10)); 
   
   
   
           Query OK, 0 rows affected (0.09 sec) 
   
   
   
           mysql> desc customer; 
   
   
   
           +--------+-------------+------+-----+---------+-------+ 
   
   
   
           | Field  | Type        | Null | Key | Default | Extra | 
   
   
   
           +--------+-------------+------+-----+---------+-------+ 
   
   
   
           | id     | varchar(3)  | YES  |     | NULL    |       | 
   
   
   
           | name   | varchar(20) | YES  |     | NULL    |       | 
   
   
   
           | age    | varchar(3)  | YES  |     | NULL    |       | 
   
   
   
           | salary | int(10)     | YES  |     | NULL    |       | 
   
   
   
           +--------+-------------+------+-----+---------+-------+

  
  
  
          
   
   
   
           mysql> select * from customer; 
   
   
   
           +------+--------+------+--------+ 
   
   
   
           | id   | name   | age  | salary | 
   
   
   
           +------+--------+------+--------+ 
   
   
   
           | 1    | John   | 30   |  80000 | 
   
   
   
           | 2    | Kevin  | 33   |  84000 | 
   
   
   
           | 3    | Mark   | 28   |  90000 | 
   
   
   
           | 4    | Jenna  | 34   |  93000 | 
   
   
   
           | 5    | Robert | 32   | 100000 | 
   
   
   
           | 6    | Zoya   | 40   |  60000 | 
   
   
   
           | 7    | Sam    | 37   |  75000 | 
   
   
   
           | 8    | George | 31   |  67000 | 
   
   
   
           | 9    | Peter  | 23   |  70000 | 
   
   
   
           | 19   | Alex   | 26   |  74000 | 
   
   
   
           +------+--------+------+-----

開始Sqoop之旅

如大家所見，其中customer表中並不包含主鍵。我在該表中並未添加多少記錄。默認情況下，Sqoop能夠識別出表中的主鍵列(如果有的話)，並將其作為划分列。該划分列的低值與高值檢索自該數據庫，而映射任務則指向符合區間要求的均勻部分。

如果主鍵並未均勻分布在該區間當中，那么任務將出現不平衡狀況。這時，大家應當明確選定一個與--split-by參數不同的列，例如--split-by id。

由於我們希望將此表直接導入至Hive中，因此需要在Sqoop命令中添加–hive-import：

  
  
  
          
   
   
   
           sqoop import --connect jdbc:mysql://localhost:3306/sqoop  
   
   
   
           --username root  
   
   
   
           -P  
   
   
   
           --split-by id  
   
   
   
           --columns id,name  
   
   
   
           --table customer   
   
   
   
           --target-dir /user/cloudera/ingest/raw/customers  
   
   
   
           --fields-terminated-by ","  
   
   
   
           --hive-import  
   
   
   
           --create-hive-table  
   
   
   
           --hive-table sqoop_workspace.customers

下面來看Sqoop命令各選項的具體作用：

connect – 提供jdbc字符串

username – 數據庫用戶名

-P – 將在控制台中詢問密碼。大家也可以使用-passwaord，但並不推薦這種作法，因為其會顯示在任務執行日志中並可能導致問題。解決辦法之一在於將數據庫密碼存儲在HDFS中的文件內，並將其向運行時交付。

table – 告知計算機我們希望導入哪個MySQL表。在這里，表名稱為customer。
split-by – 指定划分列。在這里我們指定id列。
target-dir – HDFS目標目錄。
fields-terminated-by – 我已經指定了逗號作為分隔值(默認情況下，導入HDFS的數據以逗號作為分隔值)。
hive-import – 將表導入Hive(如果不加設置，則使用Hive的默認分隔符)。
create-hive-table – 檢查如果已經存在一個Hive表，任務設置是否會因此失敗。
hive-table – 指定.。本示例中為sqoop_workspace.customers，其中sqoop_workspace為數據庫名稱，而customers則為表名稱。

如下所示，Sqoop為一項map-reduce任務。請注意，這里我使用-P作為密碼選項。除了這種方式，我們也可以使用-password實現參數化，並從文件中讀取密碼內容。

  
  
  
          
   
   
   
           sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer  --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers 
   
   
   
           Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. 
   
   
   
           Please set $ACCUMULO_HOME to the root of your Accumulo installation. 
   
   
   
           16/03/01 12:59:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0 
   
   
   
           Enter password: 
   
   
   
           16/03/01 12:59:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 
   
   
   
           16/03/01 12:59:54 INFO tool.CodeGenTool: Beginning code generation 
   
   
   
           16/03/01 12:59:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 
   
   
   
           16/03/01 12:59:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 
   
   
   
           16/03/01 12:59:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce 
   
   
   
           Note: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.java uses or overrides a deprecated API. 
   
   
   
           Note: Recompile with -Xlint:deprecation for details. 
   
   
   
           16/03/01 13:00:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar 
   
   
   
           16/03/01 13:00:01 WARN manager.MySQLManager: It looks like you are importing from mysql. 
   
   
   
           16/03/01 13:00:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 
   
   
   
           16/03/01 13:00:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 
   
   
   
           16/03/01 13:00:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 
   
   
   
           16/03/01 13:00:01 INFO mapreduce.ImportJobBase: Beginning import of customer 
   
   
   
           16/03/01 13:00:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 
   
   
   
           16/03/01 13:00:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 
   
   
   
           16/03/01 13:00:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 
   
   
   
           16/03/01 13:00:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 
   
   
   
           16/03/01 13:00:11 INFO db.DBInputFormat: Using read commited transaction isolation 
   
   
   
           16/03/01 13:00:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `customer` 
   
   
   
           16/03/01 13:00:11 WARN db.TextSplitter: Generating splits for a textual index column. 
   
   
   
           16/03/01 13:00:11 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 
   
   
   
           16/03/01 13:00:11 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 
   
   
   
           16/03/01 13:00:11 INFO mapreduce.JobSubmitter: number of splits:4 
   
   
   
           16/03/01 13:00:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456782715090_0004 
   
   
   
           16/03/01 13:00:13 INFO impl.YarnClientImpl: Submitted application application_1456782715090_0004 
   
   
   
           16/03/01 13:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/ 
   
   
   
           16/03/01 13:00:13 INFO mapreduce.Job: Running job: job_1456782715090_0004 
   
   
   
           16/03/01 13:00:47 INFO mapreduce.Job: Job job_1456782715090_0004 running in uber mode : false 
   
   
   
           16/03/01 13:00:48 INFO mapreduce.Job:  map 0% reduce 0% 
   
   
   
           16/03/01 13:01:43 INFO mapreduce.Job:  map 25% reduce 0% 
   
   
   
           16/03/01 13:01:46 INFO mapreduce.Job:  map 50% reduce 0% 
   
   
   
           16/03/01 13:01:48 INFO mapreduce.Job:  map 100% reduce 0% 
   
   
   
           16/03/01 13:01:48 INFO mapreduce.Job: Job job_1456782715090_0004 completed successfully 
   
   
   
           16/03/01 13:01:48 INFO mapreduce.Job: Counters: 30 
   
   
   
               File System Counters 
   
   
   
                   FILE: Number of bytes read=0 
   
   
   
                   FILE: Number of bytes written=548096 
   
   
   
                   FILE: Number of read operations=0 
   
   
   
                   FILE: Number of large read operations=0 
   
   
   
                   FILE: Number of write operations=0 
   
   
   
                   HDFS: Number of bytes read=409 
   
   
   
                   HDFS: Number of bytes written=77 
   
   
   
                   HDFS: Number of read operations=16 
   
   
   
                   HDFS: Number of large read operations=0 
   
   
   
                   HDFS: Number of write operations=8 
   
   
   
               Job Counters  
   
   
   
                   Launched map tasks=4 
   
   
   
                   Other local map tasks=5 
   
   
   
                   Total time spent by all maps in occupied slots (ms)=216810 
   
   
   
                   Total time spent by all reduces in occupied slots (ms)=0 
   
   
   
                   Total time spent by all map tasks (ms)=216810 
   
   
   
                   Total vcore-seconds taken by all map tasks=216810 
   
   
   
                   Total megabyte-seconds taken by all map tasks=222013440 
   
   
   
               Map-Reduce Framework 
   
   
   
                   Map input records=10 
   
   
   
                   Map output records=10 
   
   
   
                   Input split bytes=409 
   
   
   
                   Spilled Records=0 
   
   
   
                   Failed Shuffles=0 
   
   
   
                   Merged Map outputs=0 
   
   
   
                   GC time elapsed (ms)=2400 
   
   
   
                   CPU time spent (ms)=5200 
   
   
   
                   Physical memory (bytes) snapshot=418557952 
   
   
   
                   Virtual memory (bytes) snapshot=6027804672 
   
   
   
                   Total committed heap usage (bytes)=243007488 
   
   
   
               File Input Format Counters  
   
   
   
                   Bytes Read=0 
   
   
   
               File Output Format Counters  
   
   
   
                   Bytes Written=77 
   
   
   
           16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Transferred 77 bytes in 104.1093 seconds (0.7396 bytes/sec) 
   
   
   
           16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Retrieved 10 records. 
   
   
   
           16/03/01 13:01:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 
   
   
   
           16/03/01 13:01:49 INFO hive.HiveImport: Loading uploaded data into Hive 
   
   
   
           Logging initialized using configuration in jar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.properties 
   
   
   
           OK 
   
   
   
           Time taken: 2.163 seconds 
   
   
   
           Loading data to table sqoop_workspace.customers 
   
   
   
           chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000': User does not belong to supergroup 
   
   
   
           chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001': User does not belong to supergroup 
   
   
   
           chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002': User does not belong to supergroup 
   
   
   
           chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003': User does not belong to supergroup 
   
   
   
           Table sqoop_workspace.customers stats: [numFiles=4, totalSize=77] 
   
   
   
           OK 
   
   
   
           Time taken: 1.399 seconds

最后，讓我們驗證Hive中的輸出結果：

  
  
  
          
   
   
   
           hive> show databases; 
   
   
   
           OK 
   
   
   
           default 
   
   
   
           sqoop_workspace 
   
   
   
           Time taken: 0.034 seconds, Fetched: 2 row(s) 
   
   
   
           hive> use sqoop_workspace; 
   
   
   
           OK 
   
   
   
           Time taken: 0.063 seconds 
   
   
   
           hive> show tables; 
   
   
   
           OK 
   
   
   
           customers 
   
   
   
           Time taken: 0.036 seconds, Fetched: 1 row(s) 
   
   
   
           hive> show create table customers; 
   
   
   
           OK 
   
   
   
           CREATE TABLE `customers`( 
   
   
   
             `id` string,  
   
   
   
             `name` string) 
   
   
   
           COMMENT 'Imported by sqoop on 2016/03/01 13:01:49' 
   
   
   
           ROW FORMAT DELIMITED  
   
   
   
             FIELDS TERMINATED BY ','  
   
   
   
             LINES TERMINATED BY '\n'  
   
   
   
           STORED AS INPUTFORMAT  
   
   
   
             'org.apache.hadoop.mapred.TextInputFormat'  
   
   
   
           OUTPUTFORMAT  
   
   
   
             'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
   
   
   
           LOCATION 
   
   
   
             'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers' 
   
   
   
           TBLPROPERTIES ( 
   
   
   
             'COLUMN_STATS_ACCURATE'='true',  
   
   
   
             'numFiles'='4',  
   
   
   
             'totalSize'='77',  
   
   
   
             'transient_lastDdlTime'='1456866115') 
   
   
   
           Time taken: 0.26 seconds, Fetched: 18 row(s)

hive> select * from customers;

1 John

2 Kevin

19 Alex

3 Mark

4 Jenna

5 Robert

6 Zoya

7 Sam

8 George

9 Peter

Time taken: 1.123 seconds, Fetched: 10 row(s).

到此完成!從MySQL到Hive，數據遷移工作就是這么簡單。

來源： http://developer.51cto.com/art/201605/510738.htm

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 遷移hive表及hive數據關於hdfs 和hive的數據遷移 MySQL向Hive/HBase的遷移工具 HBase數據遷移至Hive hive數據遷移到clickhouse Druid數據庫連接池就這么簡單數據庫遷移之從oracle 到 MySQL最簡單的方法數據庫遷移之從oracle 到 MySQL最簡單的方法遞歸就這么簡單 Elasticsearch就這么簡單