[sqoop1.99.7] sqoop實例——數據ETL


 

一、創建一個mysql的link

MySQL鏈接使用的是JDBC,必須有對應的驅動文件jar,還得有對應的訪問權限,請確保能在server端訪問MySQL。確保mysql的jar包已經導入到${SQOOP_HOME}/server/lib/目錄下。

 create link -c generic-jdbc-connector  
這時候就會出現交互會話,提示你輸入各項參數:
【Link configuration】
Name:標示這個link的字符串。比如:mysql-link-1
Driver Class:指定jdbc啟動時所需要加載的driver類:com.mysql.jdbc.Driver。
Connection String:jdbc:mysql://localhost/test_db,test_db是本例的數據庫名稱。
Username:鏈接數據庫的用戶名,也就是mysql客戶端傳入的-u參數。本例是test。
Password:鏈接數據庫的用戶密碼。
FetchSize:這個屬性並沒有在官方文檔上描述,我也不知道說的是什么,直接回車了,使用的默認值。

填寫完上面幾項,將提供一個可以輸入JDBC屬性的hash,提示符是entry#,這時候可以手動指定很多JDBC屬性的值。
本例只覆蓋了一個protocol值為tcp:protocol
=tcp 再按回車,之后會再定義一下SQL方言。各個數據庫系統提供商們對SQL語言標准的理解和實現各有不同,
於是各有各的一些細微差別。以下屬性就是用於指定這些區別的。
至此,就可以完成這個link的創建。命令行提示符也會還原為sqoop:000>

 

例子:

sqoop:000> create link -connector generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql-link

Database connection

Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://192.168.200.250:3306/testdb
Username: root
Password: *******
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry# protocol=tcp
There are currently 1 values in the map:
protocol = tcp
entry#

SQL Dialect

Identifier enclose:  注意  這里不能直接回車!要打一個空格符號!因為如果不打,查詢mysql表的時候會在表上加上“”,導致查詢出錯!
New link was successfully created with validation status OK and name mysql-link
sqoop:000> show link
+------------+------------------------+---------+
| Name | Connector Name | Enabled |
+------------+------------------------+---------+
| mysql-link | generic-jdbc-connector | true |
+------------+------------------------+---------+

 

二、創建一個hdfs的link

create link -connector hdfs-connector  

hdfs的參數只有一個Name和一個HDFS URI,Name同上。URI是hadoop中配置hdfs-site.xml中的屬性fs.defaultFS的值。本例為hdfs://localhost:9000,回車后沒有什么錯誤就會顯示successful信息。

例子:

sqoop:000> create link -connector hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfs-link

HDFS cluster

URI: hdfs://hadoop-allinone-200-123.wdcloud.locl:9000
Conf directory:
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name hdfs-link

 

查看創建的link

sqoop:000> show link
+--------------+------------------------+---------+
|     Name     |     Connector Name     | Enabled |
+--------------+------------------------+---------+
| hdfs-link-1  | hdfs-connector         | true    |
| mysql-link-1 | generic-jdbc-connector | true    |
+--------------+------------------------+---------+

 

 

三、創建傳輸任務JOB

create job -f "mysql-link" -t "hdfs-link"

-f指定from,即是數據源位置,-t指定to,即是目的地位置。本例是從MySQL傳遞數據到HDFS,所以就是from mysql to HDFS。參數值就是在創建鏈接(link)時指定的Name。

以下是各個屬性
Name:一個標示符,自己指定即可。
Schema Name:指定Database或Schema的名字,在MySQL中,Schema同Database類似,具體什么區別沒有深究過,但官網描述在創建時差不多。這里指定數據庫名字為db_ez即可,本例的數據庫。
Table Name:本例使用的數據庫表為tb_forhadoop,自己指定導出的表。多表的情況請自行查看官方文檔。
SQL Statement:就是sql查詢語句,文檔上說需要指定一個$condition,但我一直沒有創建成功,貌似是一個條件子句。
配置完以上幾項,又回出現element#提示符,提示輸入一些hash值,這里我沒有再指定什么。直接回車過。而以下幾個配置我也是直接回車,使用默認值,大概是一些與數據庫相關的參數。
Partition column:
Partition column nullable:
Boundary query
Last value
后面需要配置數據目的地各項值:
Null alue:大概說的是如果有空值用什么覆蓋
File format:指定在HDFS中的數據文件是什么文件格式,這里使用TEXT_FILE,即最簡單的文本文件。
Compression codec:用於指定使用什么壓縮算法進行導出數據文件壓縮,我指定NONE,這個也可以使用自定義的壓縮算法CUSTOM,用Java實現相應的接口。
Custom codec:這個就是指定的custom壓縮算法,本例選擇NONE,所以直接回車過去。
Output directory:指定存儲在HDFS文件系統中的路徑,這里最好指定一個存在的路徑,或者存在但路勁下是空的,貌似這樣才能成功。
Append mode:用於指定是否是在已存在導出文件的情況下將新數據追加到數據文件中。
Extractors:不清楚是什么,我取了一個1
Loaders:同上
最后再次出現element#提示符,用於輸入extra mapper jars的屬性,可以什么都不寫。直接回車。

至此若出現successful則證明已經成功創建。

 

例子:

先在mysql數據庫創建一個名為testdb的db,並在該db下創建表table001

 

開始創建任務

sqoop:000> create job -f "mysql-link" -t "hdfs-link"  
Creating job for links with from name mysql-link-200.250.sqoop and to name hdfs-link
Please fill following values to create new job object
Name: job1

Database source

Schema name: testdb
Table name: table001
SQL statement: 
Column names: 
There are currently 0 values in the list:
element# 
Partition column: 
Partition column nullable: 
Boundary query: 

Incremental read

Check column: 
Last value: 

Target configuration

Override null value: 
Null value: 
File format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom codec: 
Output directory: /wdcloud/app/sqoop-1.99.7/import_data
Append mode: 

Throttling resources

Extractors: 2
Loaders: 2

Classpath configuration

Extra mapper jars: 
There are currently 0 values in the list:
element# 
New job was successfully created with validation status OK  and name mysql-import-user-table

 

查看任務:

sqoop:000> show job
+----+------+-------------------------------------+----------------------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+------+-------------------------------------+----------------------------+---------+
| 2 | job1 | mysql-link (generic-jdbc-connector) | hdfs-link (hdfs-connector) | true |
+----+------+-------------------------------------+----------------------------+---------+

 

 

四、執行和查看Job狀態

start job -n jobname  
status job -n jobname  

 

啟動任務報錯

sqoop:000> start job -n job1 
Exception has occurred during processing command 
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0001:Unable to get a connection - 

 

設置可查看具體出錯信息

sqoop:000> set option --name verbose --value true
Verbose option was changed to true

 

報錯信息:好像是找不到這台mysql主機

sqoop:000> start job -n job1 
Exception has occurred during processing command 
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0001:Unable to get a connection - 
Stack trace:
     at  org.apache.sqoop.client.request.ResourceRequest (ResourceRequest.java:137)  
     at  org.apache.sqoop.client.request.ResourceRequest (ResourceRequest.java:187)  
     at  org.apache.sqoop.client.request.JobResourceRequest (JobResourceRequest.java:113)  
。。。。。。。
關鍵信息: Caused by: Exception: java.lang.Throwable Message: Communications link failure Caused by: Exception: java.net.UnknownHostException Message: mysql.server: Name or service not known

 

原來是創建mysq的link的時候 url寫錯了,修改url為:jdbc:mysql://192.168.200.250:3306/testdb

 

繼續執行報錯

執行一直報這個錯    搞不下去了 
不知道為什么給shema和table加了雙引號

Exception has occurred during processing command 
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0016:Can't fetch schema - 


Caused by: Exception: java.lang.Throwable Message: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"testdb"."table001"' at line 1
執行一直報這個錯    搞不下去了 
不知道為什么給shema和table加了雙引號

Exception has occurred during processing command 
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0016:Can't fetch schema - 

Caused by: Exception: java.lang.Throwable Message: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"testdb"."table001"' at line 1

解決: 在創建mysql鏈接時,Identifier enclose:指定SQL中標識符的定界符,也就是說,有的SQL標示符是一個引號:select * from "table_name",這種定界符在MySQL中是會報錯的。這個屬性默認值就是雙引號,所以不能使用回車,必須將之覆蓋,我使用空格覆蓋了這個值。吐槽一下,這個錯誤我整了一整天才搞明白,官方文檔也是坑啊! 所以修改下mysql的link吧。

 

繼續報錯

2016-12-19 03:16:43 EST: FAILURE_ON_SUBMIT 
Exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/tmp/hadoop-yarn/staging/root/.staging":hadoop:supergroup:drwxr-xr-x
提示往HDFS寫文件是不容許的,這個文件夾沒有權限,進入hdfs目錄為hadoop用戶的/tmp文件夾授權
[root@hadoop-allinone-200-123 bin]# su hadoop
[hadoop@hadoop-allinone-200-123 bin]$ ./hadoop fs -ls / Found 1 items drwxr-xr-x - hadoop supergroup 0 2016-11-24 21:50 /hbase
#創建/tmp文件夾 [hadoop@hadoop
-allinone-200-123 bin]$ ./hadoop fs -mkdir /tmp [hadoop@hadoop-allinone-200-123 bin]$ ./hadoop fs -ls / Found 2 items drwxr-xr-x - hadoop supergroup 0 2016-11-24 21:50 /hbase drwxr-xr-x - hadoop supergroup 0 2016-12-19 03:34 /tmp
#為tmp文件夾授權777 [hadoop@hadoop
-allinone-200-123 bin]$ ./hadoop fs -chmod 777 /tmp [hadoop@hadoop-allinone-200-123 bin]$ ./hadoop fs -ls / Found 2 items drwxr-xr-x - hadoop supergroup 0 2016-11-24 21:50 /hbase drwxrwxrwx - hadoop supergroup 0 2016-12-19 03:34 /tmp

 

終於不報錯了:

sqoop:000> start job -n job1
Submission details
Job Name: job1
Server URL: http://localhost:12000/sqoop/
Created by: root
Creation date: 2016-12-19 03:35:15 EST
Lastly updated by: root
External ID: job_1479957438728_0001
    http://hadoop-allinone-200-123.wdcloud.locl:8088/proxy/application_1479957438728_0001/
Source Connector schema: Schema{name= testdb . table001 ,columns=[
    FixedPoint{name=id,nullable=true,type=FIXED_POINT,byteSize=4,signed=true},
    Text{name=name,nullable=true,type=TEXT,charSize=null},
    Text{name=address,nullable=true,type=TEXT,charSize=null}]}
2016-12-19 03:35:15 EST: BOOTING  - Progress is not available

 

查看任務執行狀態,繼續報錯

sqoop:000> status job -n job1
Exception has occurred during processing command 
Exception: org.apache.sqoop.common.SqoopException Message: MAPREDUCE_0003:Can't get RunningJob instance - 

Caused by: Exception: java.io.IOException Message: java.net.ConnectException: Call From hadoop-allinone-200-123.wdcloud.locl/192.168.200.123 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Caused by: Exception: java.net.ConnectException Message: Connection refused

表示: hadoop運行mapreduce作業無法連接0.0.0.0/0.0.0.0:10020

這個問題是由於沒有啟動historyserver引起的,解決辦法:
在mapred-site.xml配置文件中添加:
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop-allinone-200-123.wdcloud.locl:10020</value>
</property>
在namenode上執行命令:mr-jobhistory-daemon.sh start historyserver 
這樣在,namenode上會啟動JobHistoryServer服務,可以在historyserver的日志中查看運行情況
[hadoop@hadoop-allinone-200-123 sbin]$ pwd /wdcloud/app/hadoop-2.7.3/sbin

[hadoop@hadoop-allinone-200-123 sbin]$ ll | grep jobhistory
-rwxr-xr-x 1 hadoop hadoop 4080 Aug 17 21:49 mr-jobhistory-daemon.sh

[hadoop@hadoop-allinone-200-123 sbin]$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /wdcloud/app/hadoop-2.7.3/logs/mapred-hadoop-historyserver-hadoop-allinone-200-123.wdcloud.locl.out

[hadoop@hadoop-allinone-200-123 sbin]$ jps | grep JobHistoryServer
16818 JobHistoryServer

 

終於可以執行了,但是任務執行失敗

sqoop:000> status job -n job1
Submission details
Job Name: job1
Server URL: http://localhost:12000/sqoop/
Created by: root
Creation date: 2016-12-19 04:01:25 EST
Lastly updated by: root
External ID: job_1479957438728_0004
    http://hadoop-allinone-200-123.wdcloud.locl:8088/proxy/application_1479957438728_0004/
2016-12-19 04:02:51 EST: RUNNING  - 0.00 %

 

報錯信息:(據說是內存不足?)

sqoop:000> status job -n job1
Submission details
Job Name: job1
Server URL: http://localhost:12000/sqoop/
Created by: root
Creation date: 2016-12-19 04:01:25 EST
Lastly updated by: root
External ID: job_1479957438728_0004
    http://hadoop-allinone-200-123.wdcloud.locl:8088/proxy/application_1479957438728_0004/
2016-12-19 04:04:48 EST: FAILED 
Exception: Job Failed with status:3

 

 

能map不能reduce,呵呵噠!

reduce 

map任務內存溢出

 

在mapred-site.xml中設置:(默認200)

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx2048m</value>
</property>

 

reduce出錯信息

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM