一、簡介
Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources. Relational databases are examples of structured data sources with well defined schema for the data they store. Cassandra, Hbase are examples of semi-structured data sources and HDFS is an example of unstructured data source that Sqoop can support. Apache Sqoop 是設計來用於在結構化、半結構化和非結構化數據源之間有效轉換數據的工具之一。 關系型數據庫存儲了良好定義的結構化的模型數據。 Cassandra, Hbase 存儲的是半結構化的數據。 HDFS 存儲的是非結構化的數據。 這些都是Sqoop支持數據轉換的數據庫。
官網:
http://sqoop.apache.org/
版本:
Sqoop版本分Sqoop1和Sqoop2,其中Sqoop1目前最高釋出版本為1.4.6,Sqoop2最高釋出版本為1.99.7,Sqoop1與Sqoop2相互間不兼容,而且Sqoop2目的並不是作為產品,主要是致力於開發。再者,其對Hadoop的支持版本有些特別要求,比如Hadoop1和Hadoop0.x還有Hadoop2.x的兼容性等。在下載時一般要注意其兼容的Hadoop版本(Sqoop官網上我沒有看到相關具體的描述,只是通過下載的文件名辨別與Hadoop的兼容性)。 Sqoop進行數據轉移時必須依賴於Hadoop的MapReduce作業,所以Hadoop必須在環境中存在,且能被Sqoop訪問。
下載時直接選擇已編譯好的bin版本,直接用。也可以下源代碼到本地編譯安裝,確保有Java環境,因為Sqoop用Java編寫的。 1、sqoop1 穩定版本 sqoop 1.4.6 http://sqoop.apache.org/docs/1.4.6/index.html http://mirror.bit.edu.cn/apache/sqoop/1.4.6/ 下載文件名: sqoop-1.4.6.bin__hadoop-0.23.tar.gz sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz 源碼:sqoop-1.4.6.tar.gz 2、sqoop2 最新版本 sqoop 1.99.7 http://sqoop.apache.org/docs/1.99.7/index.html http://mirror.bit.edu.cn/apache/sqoop/1.99.7/ 下載文件名: sqoop-1.99.7-bin-hadoop200.tar.gz 源碼:sqoop-1.99.7.tar.gz
二、安裝配置
下載版本:
sqoop-1.99.7-bin-hadoop200.tar.gz
安裝:直接解壓放在任意目錄即可。
tar -zxvf sqoop-1.99.7-bin-hadoop200.tar.gz
mv sqoop-1.99.7-bin-hadoop200 sqoop1.99.7
sqoop目錄
bin:可執行腳本,一般使用sqoop都是通過這個目錄中的工具調用,是一些shell或batch腳本。
conf:存放配置文件、目前僅有兩個配置文件:sqoop_bootstrap.properties 和 sqoop.properties
docs:目前不清楚具體是什么,可能是幫助文檔,不過一般使用sqoop不會用到。
server:里面只有一個lib目錄,存了很多jar文件,是sqoop2 的server包。
shell:里面只有一個lib目錄,存了很多jar文件,sqoop2的shell包。
tools:里面只有一個lib目錄,存了很多jar文件,sqoop2的工具包。
配置
(1)安裝Java JDK
版本
[root@hadoop-allinone-200-123 hadoop-2.7.3]# java -version
java version "1.7.0_67"
JAVA_HOME
[root@hadoop-allinone conf]# echo $JAVA_HOME
/wdcloud/app/jdk1u7
(2)Hadoop環境
版本 [root@hadoop-allinone-200-123 bin]# ./hadoop version Hadoop 2.7.3 HADOOP_HOME [root@hadoop-allinone-200-123 hadoop-2.7.3]# pwd /wdcloud/app/hadoop-2.7.3
(3)配置環境變量
添加一個系統環境變量,HADOOP_HOME,本例中設置為/home/hadoop/hadoop2.6。
無論是/etc/profile還是在/etc/profile.d中創建一個腳本導入變量,亦或是在~/.bashrc文件中寫,都可以:
在/etc/profile(全局環境變量)中加入hadoop環境變量 export HADOOP_HOME=/wdcloud/app/hadoop-2.7.3 [root@hadoop-allinone-200-123 hadoop-2.7.3]# source /etc/profile [root@hadoop-allinone-200-123 hadoop-2.7.3]# echo $HADOOP_HOME /wdcloud/app/hadoop-2.7.3
注意:配置這個變量主要是讓Sqoop能找到以下目錄的jar文件和Hadoop配置文件: $HADOOP_HOME/share/hadoop/common $HADOOP_HOME/share/hadoop/hdfs $HADOOP_HOME/share/hadoop/mapreduce $HADOOP_HOME/share/hadoop/yarn 官網上說名了可以單獨對各個組建進行配置,使用以下變量: $HADOOP_COMMON_HOME = /wdcloud/app/hadoop-2.7.3/share/hadoop/common $HADOOP_HDFS_HOME = /wdcloud/app/hadoop-2.7.3/share/hadoop/hdfs $HADOOP_MAPRED_HOME = /wdcloud/app/hadoop-2.7.3/share/hadoop/mapreduce $HADOOP_YARN_HOME = /wdcloud/app/hadoop-2.7.3/share/hadoop/yarn 若$HADOOP_HOME已經配置了,最好不要再配置下面的變量,可能會有些莫名錯誤。
配置sqoop根目錄和第三方jar引用路徑
[root@hadoop-allinone-200-123 hadoop-2.7.3]# vim /etc/profile export SQOOP_HOME=/wdcloud/app/sqoop1.99.7 export SQOOP_SERVER_EXTRA_LIB=/wdcloud/app/sqoop1.99.7/extra
[root@hadoop-allinone-200-123 sqoop-1.99.7]# mkdir extra
把mysql的驅動jar文件復制到這個目錄下。
(4)配置Hadoop代理訪問
因為sqoop訪問Hadoop的MapReduce使用的是代理的方式,必須在Hadoop中配置所接受的proxy用戶和組。
找到Hadoop的core-site.xml配置文件(本例是$HADOOP_HOME/etc/hadoop/core-site.xml):
<property> <name>hadoop.proxyuser.$SERVER_USER.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.$SERVER_USER.groups</name> <value>*</value> </property>
$SERVER_USER是運行Sqoop2 Server的系統用戶,本例我使用了hadoop用戶運行server,所以將之代替為hadoop。
注意:保證你的用戶id大於1000(可用id命令查看),否則作為系統變量運行時,可能需要其他配置,參照官網。
(5)sqoop核心配置文件
sqoop_bootstrap.properties
配置config支持類,這里一般使用默認值即可:
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider
sqoop.properties
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/wdcloud/app/hadoop-2.7.3/etc/hadoop org.apache.sqoop.security.authentication.type=SIMPLE org.apache.sqoop.security.authentication.handler=org.apache.sqoop.security.authentication.SimpleAuthenticationHandler org.apache.sqoop.security.authentication.anonymous=true
注意:官方文檔上只說了配置上面第一項,mapreduce的配置文件路徑,但后來運行出現authentication異常,找到sqoop文檔描述security部分,發現sqoop2支持hadoop的simple和kerberos兩種驗證機制。所以配置了一個simple驗證,這個異常才消除。
三、運行
驗證配置是否有效
bin/sqoop2-tool verify
[root@hadoop-allinone-200-123 sqoop-1.99.7]# bin/sqoop2-tool verify Setting conf dir: /wdcloud/app/sqoop-1.99.7/bin/../conf Sqoop home directory: /wdcloud/app/sqoop-1.99.7 Sqoop tool executor: Version: 1.99.7 Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine Running tool: class org.apache.sqoop.tools.tool.VerifyTool 0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server. 20 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread Verification was successful. Tool class org.apache.sqoop.tools.tool.VerifyTool has finished correctly.
開啟服務器
bin/sqoop2-server start
[root@hadoop-allinone-200-123 sqoop-1.99.7]# bin/sqoop2-server start Setting conf dir: /wdcloud/app/sqoop-1.99.7/bin/../conf Sqoop home directory: /wdcloud/app/sqoop-1.99.7 Starting the Sqoop2 server... 0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server. 22 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread Sqoop2 server started.
#開啟服務器后生成了兩個目錄(在那個目錄下運行就在哪個目錄下生成) [root@hadoop-allinone-200-123 sqoop-1.99.7]# ll | grep @ drwxr-xr-x 3 root root 23 Dec 18 22:19 @BASEDIR@ drwxr-xr-x 2 root root 58 Dec 18 22:23 @LOGDIR@ #查看sqoop運行日志: [root@hadoop-allinone-200-123 sqoop-1.99.7]# ll \@LOGDIR\@/ total 136 -rw-r--r-- 1 root root 165 Dec 18 22:22 audit.log -rw-r--r-- 1 root root 670 Dec 18 22:21 derbyrepo.log -rw-r--r-- 1 root root 78957 Dec 18 22:22 sqoop.log
關閉服務器
bin/sqoop2-server stop
[root@hadoop-allinone-200-123 sqoop-1.99.7]# bin/sqoop2-server stop
Setting conf dir: /wdcloud/app/sqoop-1.99.7/bin/../conf Sqoop home directory: /wdcloud/app/sqoop-1.99.7 Stopping the Sqoop2 server... Sqoop2 server stopped.
開啟客戶端
bin/sqoop2-shell
[root@hadoop-allinone-200-123 sqoop-1.99.7]# bin/sqoop2-shell Setting conf dir: /wdcloud/app/sqoop-1.99.7/bin/../conf Sqoop home directory: /wdcloud/app/sqoop-1.99.7 Sqoop Shell: Type 'help' or '\h' for help. sqoop:000>
若成功會開啟sqoop的shell命令行提示符:sqoop:000>
至此,sqoop1.99.7的配置和啟動已經完成。
四、sqoop客戶端常用命令
使用sqoop前請確保hadoop服務和Sqoop2服務器均已啟動。其中Hadoop不僅要啟動hdfs(NameNode、DataNode),還要啟動yarn(NodeManager、ResourceManager),當然,一般還會有一個SecondaryNameNode,用於原始NameNode的備援進程。
[root@hadoop-allinone-200-123 sqoop-1.99.7]# jps 4352 ResourceManager 4195 SecondaryNameNode 2835 QuorumPeerMain 21167 HMaster 4451 NodeManager 2986 QuorumPeerMain 2803 QuorumPeerMain 4030 DataNode 21256 HRegionServer 3905 NameNode 5024 SqoopJettyServer 5186 Jps
sqoop2客戶端提供各種命令行交互接口,供用戶使用。sqoop2客戶端先連接Sqoop Server,將參數傳遞過去,再調用mapreduce進行數據導入到出作業。
配置sqoop server參數
[root@hadoop-allinone-200-123 sqoop-1.99.7]# bin/sqoop2-shell Setting conf dir: /wdcloud/app/sqoop-1.99.7/bin/../conf Sqoop home directory: /wdcloud/app/sqoop-1.99.7 Sqoop Shell: Type 'help' or '\h' for help. sqoop:000>set server --host 192.168.200.123 --port 12000 --webapp sqoop Server is set successfully
注意:當設置host port 和 webapp 時,--url可以忽略
如果使用--url,用法如下:
set server --url http://sqoop2.company.net:80/sqoop
port是默認值,最后一個--webapp官方文檔說是指定的sqoop jetty服務器名稱。
配置完畢后驗證服務器是否正確連接:
sqoop:000> show version --all client version: Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016 0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable server version: Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016 API versions: [v1]
若server版本信息能正確顯示,則沒問題!能正確鏈接上。
查看幫助
Available commands: :exit (:x ) Exit the shell :history (:H ) Display, manage and recall edit-line history help (\h ) Display this help message set (\st ) Configure various client options and settings show (\sh ) Display various objects and configuration options create (\cr ) Create new object in Sqoop repository delete (\d ) Delete existing object in Sqoop repository update (\up ) Update objects in Sqoop repository clone (\cl ) Create new object based on existing one start (\sta) Start job stop (\stp) Stop job status (\stu) Display status of a job enable (\en ) Enable object in Sqoop repository disable (\di ) Disable object in Sqoop repository grant (\g ) Grant access to roles and assign privileges revoke (\r ) Revoke access from roles and remove privileges For help on a specific command type: help command
查看命令幫助:
sqoop:000> \st Usage: set [server|option|truststore] sqoop:000> \sh Usage: show [server|version|connector|driver|link|job|submission|option|role|principal|privilege] sqoop:000> \cr Usage: create [link|job|role] sqoop:000> \d Usage: delete [link|job|role] sqoop:000> \up Usage: update [link|job] sqoop:000> \cl Usage: clone [link|job] sqoop:000> \sta Usage: start [job] sqoop:000> \stp Usage: stop [job] sqoop:000> \stu Usage: status [job] sqoop:000> \en Usage: enable [link|job] sqoop:000> \di Usage: disable [link|job] sqoop:000> \g Usage: grant [role|privilege] sqoop:000> \r Usage: revoke [role|privilege]
例如:如果需要退出命令行交互工具,請輸入[:x]命令
sqoop:000> :x
[root@hadoop-allinone-200-123 sqoop-1.99.7]#