1.什么是Sqoop
Sqoop即 SQL to Hadoop ,是一款方便的在傳統型數據庫與Hadoop之間進行數據遷移的工具,充分利用MapReduce並行特點以批處理的方式加快數據傳輸,發展至今主要演化了二大版本,Sqoop1和Sqoop2。
Sqoop工具是hadoop下連接關系型數據庫和Hadoop的橋梁,支持關系型數據庫和hive、hdfs,hbase之間數據的相互導入,可以使用全表導入和增量導入。
那么為什么選擇Sqoop呢?
- 高效可控的利用資源,任務並行度,超時時間。
- 數據類型映射與轉化,可自動進行,用戶也可自定義
- 支持多種主流數據庫,MySQL,Oracle,SQL Server,DB2等等
2.Sqoop1和Sqoop2對比的異同之處
- 兩個不同的版本,完全不兼容
- 版本號划分區別,Apache版本:1.4.x(Sqoop1); 1.99.x(Sqoop2) CDH版本 : Sqoop-1.4.3-cdh4(Sqoop1) ; Sqoop2-1.99.2-cdh4.5.0 (Sqoop2)
- Sqoop2比Sqoop1的改進
- 引入Sqoop server,集中化管理connector等
- 多種訪問方式:CLI,Web UI,REST API
- 引入基於角色的安全機制
3.Sqoop1與Sqoop2的架構圖
Sqoop架構圖1
Sqoop架構圖2
4.Sqoop1與Sqoop2的優缺點
比較 |
Sqoop1 |
Sqoop2 |
架構 |
僅僅使用一個Sqoop客戶端 |
引入了Sqoop server集中化管理connector,以及rest api,web,UI,並引入權限安全機制 |
部署 |
部署簡單,安裝需要root權限,connector必須符合JDBC模型 |
架構稍復雜,配置部署更繁瑣 |
使用 |
命令行方式容易出錯,格式緊耦合,無法支持所有數據類型,安全機制不夠完善,例如密碼暴漏 |
多種交互方式,命令行,web UI,rest API,conncetor集中化管理,所有的鏈接安裝在Sqoop server上,完善權限管理機制,connector規范化,僅僅負責數據的讀寫 |
5.Sqoop的安裝部署
5.0 安裝環境
hadoop:hadoop-1.0.4
sqoop:sqoop-1.4.5.bin__hadoop-1.0.0
5.1 下載安裝包及解壓
tar -zxvf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
ln -s ./package/sqoop-1.4.5.bin__hadoop-1.0.0/ sqoop
5.2 配置環境變量和配置文件
cd sqoop/conf/
mv sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
在sqoop-env.sh中添加如下代碼
#Set path to where bin/hadoop is available export HADOOP_COMMON_HOME=/home/hadoop/hadoop #Set path to where hadoop-*-core.jar is available export HADOOP_MAPRED_HOME=/home/hadoop/hadoop #set the path to where bin/hbase is available export HBASE_HOME=/home/hadoop/hbase #Set the path to where bin/hive is available export HIVE_HOME=/home/hadoop/hive #Set the path for where zookeper config dir is export ZOOCFGDIR=/home/hadoop/zookeeper
(如果數據讀取不設計hbase和hive,那么相關hbase和hive的配置可以不加,如果集群有獨立的zookeeper集群,那么配置zookeeper,反之,不用配置)。
5.3 copy需要的lib包到Sqoop/lib
所需的包:hadoop-core包、Oracle的jdbc包、mysql的jdbc包(由於我的項目只用到Oracle,因此只用了oracle的jar包:ojdbc6.jar)
cp ~/hadoop/hadoop-core-1.0.4.jar ~/sqoop/lib/
cp ojdbc6.jar ~/sqoop/lib/
5.4 添加環境變量
vi ~/.bash_profile
添加如下內容
#Sqoop export SQOOP_HOME=/home/hadoop/sqoop export PATH=$PATH:$SQOOP_HOME/bin
source ~/.bash_profile
5.5 測試oracle數據庫的連接使用
①連接oracle數據庫,列出所有的數據庫
[hadoop@eb179 sqoop]$sqoop list-databases --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq -P
或者sqoop list-databases --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq --password 123456
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: $HADOOP_HOME is deprecated.
14/08/17 11:59:24 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Enter password:
14/08/17 11:59:27 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
14/08/17 11:59:27 INFO manager.SqlManager: Using default fetchSize of 1000
14/08/17 11:59:51 INFO manager.OracleManager: Time zone has been set to GMT
MRDRP
MKFOW_QH
②Oracle數據庫的表導入到HDFS
注意:
- 默認情況下會使用4個map任務,每個任務都會將其所導入的數據寫到一個單獨的文件中,4個文件位於同一目錄,本例中 -m1表示只使用一個map任務
- 文本文件不能保存為二進制字段,並且不能區分null值和字符串值"null"
- 執行下面的命令后會生成一個ENTERPRISE.java文件,可以通過ls ENTERPRISE.java查看,代碼生成是sqoop導入過程的必要部分,sqoop在將源數據庫中的數據寫到HDFS前,首先會用生成的代碼將其進行反序列化
[hadoop@eb179 ~]$ sqoop import --connect jdbc:oracle:thin:@10.1.69.173:1521:ORCLBI --username huangq --password 123456 --table ORD_UV -m 1 --target-dir /user/sqoop/test --direct-split-size 67108864
Warning: /home/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: $HADOOP_HOME is deprecated.
14/08/17 15:21:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
14/08/17 15:21:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/08/17 15:21:34 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
14/08/17 15:21:34 INFO manager.SqlManager: Using default fetchSize of 1000
14/08/17 15:21:34 INFO tool.CodeGenTool: Beginning code generation
14/08/17 15:21:46 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ORD_UV t WHERE 1=0
14/08/17 15:21:46 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop
Note: /tmp/sqoop-hadoop/compile/328657d577512bd2c61e07d66aaa9bb7/ORD_UV.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/08/17 15:21:47 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/328657d577512bd2c61e07d66aaa9bb7/ORD_UV.jar
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:47 INFO mapreduce.ImportJobBase: Beginning import of ORD_UV
14/08/17 15:21:47 INFO manager.OracleManager: Time zone has been set to GMT
14/08/17 15:21:49 INFO db.DBInputFormat: Using read commited transaction isolation
14/08/17 15:21:49 INFO mapred.JobClient: Running job: job_201408151734_0027
14/08/17 15:21:50 INFO mapred.JobClient: map 0% reduce 0%
14/08/17 15:22:12 INFO mapred.JobClient: map 100% reduce 0%
14/08/17 15:22:17 INFO mapred.JobClient: Job complete: job_201408151734_0027
14/08/17 15:22:17 INFO mapred.JobClient: Counters: 18
14/08/17 15:22:17 INFO mapred.JobClient: Job Counters
14/08/17 15:22:17 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15862
14/08/17 15:22:17 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/17 15:22:17 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/08/17 15:22:17 INFO mapred.JobClient: Launched map tasks=1
14/08/17 15:22:17 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/08/17 15:22:17 INFO mapred.JobClient: File Output Format Counters
14/08/17 15:22:17 INFO mapred.JobClient: Bytes Written=1472
14/08/17 15:22:17 INFO mapred.JobClient: FileSystemCounters
14/08/17 15:22:17 INFO mapred.JobClient: HDFS_BYTES_READ=87
14/08/17 15:22:17 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33755
14/08/17 15:22:17 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1472
14/08/17 15:22:17 INFO mapred.JobClient: File Input Format Counters
14/08/17 15:22:17 INFO mapred.JobClient: Bytes Read=0
14/08/17 15:22:17 INFO mapred.JobClient: Map-Reduce Framework
14/08/17 15:22:17 INFO mapred.JobClient: Map input records=81
14/08/17 15:22:17 INFO mapred.JobClient: Physical memory (bytes) snapshot=192405504
14/08/17 15:22:17 INFO mapred.JobClient: Spilled Records=0
14/08/17 15:22:17 INFO mapred.JobClient: CPU time spent (ms)=1540
14/08/17 15:22:17 INFO mapred.JobClient: Total committed heap usage (bytes)=503775232
14/08/17 15:22:17 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2699571200
14/08/17 15:22:17 INFO mapred.JobClient: Map output records=81
14/08/17 15:22:17 INFO mapred.JobClient: SPLIT_RAW_BYTES=87
14/08/17 15:22:17 INFO mapreduce.ImportJobBase: Transferred 1.4375 KB in 29.3443 seconds (50.1631 bytes/sec)
14/08/17 15:22:17 INFO mapreduce.ImportJobBase: Retrieved 81 records.
③數據導出Oracle和HBase
- 使用export可將hdfs中數據導入到遠程數據庫中
export --connect jdbc:oracle:thin:@192.168.**.**:**:**--username **--password=** -m1table VEHICLE--export-dir /user/root/VEHICLE
- 向Hbase導入數據
sqoop import --connect jdbc:oracle:thin:@192.168.**.**:**:**--username**--password=**--m 1 --table VEHICLE --hbase-create-table --hbase-table VEHICLE--hbase-row-key ID --column-family VEHICLEINFO --split-by ID
5.6 測試Mysql數據庫的使用
前提:導入mysql jdbc的jar包
①測試數據庫連接
sqoop list-databases –connect jdbc:mysql://192.168.10.63 –username root–password 123456
②Sqoop的使用
以下所有的命令每行之后都存在一個空格,不要忘記
(以下6中命令都沒有進行過成功測試)
<1>mysql–>hdfs
sqoop export –connect
jdbc:mysql://192.168.10.63/ipj
–username root
–password 123456
–table ipj_flow_user
–export-dir hdfs://192.168.10.63:8020/user/flow/part-m-00000
前提:
(1)hdfs中目錄/user/flow/part-m-00000必須存在
(2)如果集群設置了壓縮方式lzo,那么本機必須得安裝且配置成功lzo
(3)hadoop集群中每個節點都要有對mysql的操作權限
<2>hdfs–>mysql
sqoop import –connect
jdbc:mysql://192.168.10.63/ipj
–table ipj_flow_user
<3>mysql–>hbase
sqoop import –connect
jdbc:mysql://192.168.10.63/ipj
–table ipj_flow_user
–hbase-table ipj_statics_test
–hbase-create-table
–hbase-row-key id
–column-family imei
<4>hbase–>mysql
關於將Hbase的數據導入到mysql里,Sqoop並不是直接支持的,一般采用如下3種方法:
第一種:將Hbase數據扁平化成HDFS文件,然后再由Sqoop導入.
第二種:將Hbase數據導入Hive表中,然后再導入mysql。
第三種:直接使用Hbase的Java API讀取表數據,直接向mysql導入
不需要使用Sqoop。
<5>mysql–>hive
sqoop import –connect
jdbc:mysql://192.168.10.63/ipj
–table hive_table_test
–hive-import
–hive-table hive_test_table 或–create-hive-table hive_test_table
<6>hive–>mysql
sqoop export –connect
jdbc:mysql://192.168.10.63/ipj
–username hive
–password 123456
–table target_table
–export-dir /user/hive/warehouse/uv/dt=mytable
前提:mysql中表必須存在
③Sqoop其他操作
<1>列出mysql中的所有數據庫
sqoop list-databases –connect jdbc:mysql://192.168.10.63:3306/ –usernameroot –password 123456
<2>列出mysql中某個庫下所有表
sqoop list-tables –connect jdbc:mysql://192.168.10.63:3306/ipj –usernameroot –password 123456
6 Sqoop1的性能
測試數據:
表名:tb_keywords
行數:11628209
數據文件大小:1.4G
測試結果:
|
HDFS--->DB |
HDFS<---DB |
Sqoop |
428s |
166s |
HDFS<->FILE<->DB |
209s |
105s |
從結果上來看,以FILE作為中轉方式性能是要高於SQOOP的,原因如下:
- 本質上SQOOP使用的是JDBC,效率不會比MYSQL自帶的導入\導出工具效率高
- 以導入數據到DB為例,SQOOP的設計思想是分階段提交,也就是說假設一個表有1K行,那么它會先讀出100行(默認值),然后插入,提交,再讀取100行……如此往復
即便如此,SQOOP也是有優勢的,比如說使用的便利性,任務執行的容錯性等。在一些測試環境中如果需要的話可以考慮把它拿來作為一個工具使用。