目睹這頭大象是怎么跳的舞。以下是我在Ubuntu 12.10下面安裝JDK以及Hadoop的整個過程。
說明:在最開始時,我在網上各處搜比較妥當的安裝hadoop的方法,過程比較糾結;后來才發現直接在官方文檔中就可以找到可靠的安裝過程,傳送門:Hadoop Single Node Setup
一、安裝Java開發環境(Ubuntu自帶openjdk:可java -version查看版本;或執行sudo apt-get install java提示已安裝openjdk)
1、火狐下載jdk-6u37-linux-i586.bin,下載后原目錄為:/home/baron/Downloads/
2、在/usr/下新建java目錄:sudo mkdir /usr/java
3、拷貝文件至該新建目錄:sudo cp /home/baron/Downloads/jdk-6u37-linux-i586.bin /usr/java
4、更改文件權限,使之可以運行:sudo chmod u+x jdk-6u37-linux-i586.bin
5、運行該文件:sudo jdk-6u37-linux-i586.bin 。至此,usr/java/目錄下面有一個bin文件包jdk1.6.0_37,以及解壓后的同名文件夾。
6、在profile中配置jdk環境變量:sudo vi /etc/profile,並在后面加上一下幾行(千萬不能輸錯,否則進不了桌面系統,如出現該情況:ctrl+alt+F1進入root環境,驗證用戶名密碼,執行:vi /etc/profile正確修改文件):
export JAVA_HOME=/usr/java/jdk1.6.0_37
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
二、安裝ssh(hadoop使用ssh來實現cluster中各node的登錄認證,免密碼ssh設置在后文中有介紹)
sudo apt-get install ssh
三、安裝rsync(該版本Ubuntu已自帶rsync)
sudo apt-get install rsync
四、安裝hadoop
1、創建hadoop用戶組以及用戶:
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
在/home/下會有一個新的hadoop文件夾,此時最好切換至新建的hadoop用戶登陸Ubuntu。
2、將下載的hadoop拷貝至該新建文件夾下:sudo cp /home/baron/Downloads/hadoop-1.0.4-bin.tar.gz /home/hadoop/
3、進入該目錄(cd /home/hadoop/)之后,解壓該文件:sudo tar xzf hadoop-1.0.4-bin.tar.gz
4、進入hadoop-env.sh所在目錄(/hadoop-1.0.4/conf/),對該文件進行如下內容的修改:export JAVA_HOME=/usr/java/jdk1.6.0_37
5、hadoop默認是Standalone Operation。可以按照官方文檔進行測試:
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
$ cat output/*
6、或者使用Pseudo-Distributed Operation模式,參照官方文檔:
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration,Use the following:
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
7、測試可否使用ssh登陸localhost(執行后屏幕的提示忘了copy,如有提示,輸入yes):
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
如果無法登錄,則主動生成key:
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
屏幕提示(部分數據已用*代替):
Generating public/private dsa key pair.
Your identification has been saved in /home/hadoop/.ssh/id_dsa.
Your public key has been saved in /home/hadoop/.ssh/id_dsa.pub.
The key fingerprint is:
b3:5d:c4:*** hadoop@Baron-SR25E
The key's randomart image is:
+--[ DSA 1024]----+
| ...o E... |
| . ...= .. |
| o .. + |
| . * |
| S + o |
| = = . |
| . o o o |
| . o . |
| ... . |
+-----------------+
免輸入密碼登陸ssh:
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
將ssh密鑰追加到authorized_keys后面,即可實現免密鑰登陸。
8、執行格式化namenode:
Format a new distributed-filesystem:
$ bin/hadoop namenode -format
12/11/10 16:25:48 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Baron-SR25E/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
12/11/10 16:25:49 INFO util.GSet: VM type = 32-bit
12/11/10 16:25:49 INFO util.GSet: 2% max memory = 17.77875 MB
12/11/10 16:25:49 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/11/10 16:25:49 INFO util.GSet: recommended=4194304, actual=4194304
12/11/10 16:25:49 INFO namenode.FSNamesystem: fsOwner=root
12/11/10 16:25:49 INFO namenode.FSNamesystem: supergroup=supergroup
12/11/10 16:25:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/11/10 16:25:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/11/10 16:25:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/11/10 16:25:49 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/11/10 16:25:50 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/11/10 16:25:50 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
12/11/10 16:25:50 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Baron-SR25E/127.0.1.1
************************************************************/
9、按照官方文檔給出的樣例運行hadoop(切記:首先使用ssh登陸localhost):
Start the hadoop daemons:
$ bin/start-all.sh
如果錯誤提示無法創建文件夾等信息,可在命令前加上sudo,但此時也會提示該用戶名沒有權限使用sudo,所以可按下步驟進行修改:
1)進入超級用戶模式,也就是輸入"su -"
su -
系統會讓你輸入超級用戶密碼,輸入密碼后就進入了超級用戶模式,也就是root用戶模式。注意這里有"-" ,這和su是不同的,在用命令”su”的時候只是切換到root,但沒有把root的環境變量傳過去,還是當前用戶的環境變量,用”su -”命令將環境變量也一起帶過去,就象和root登錄一樣。
2)添加文件的寫權限,也就是輸入命令:
chmod u+w /etc/sudoers
3)編輯/etc/sudoers文件,也就是輸入命令:
vi /etc/sudoers
進入編輯模式,找到這一 行:
root ALL=(ALL:ALL) ALL
在它的下面添加:
hadoop ALL=(ALL:ALL) ALL
這里的hadoop是你的用戶名,然后保存退出。 。
4)撤銷文件的寫權限,也就是輸入命令:
chmod u-w /etc/sudoers
然后再執行以上命令啟動hadoop,應該沒問題了
10、繼續按照官方文檔給出的示例執行命令:
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/
Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
看到如圖的結果,我也大致滿意了,雖然還不是很清楚其中各項數據的含義,有待來日深究。
備注:對於執行hadoop命令過程中提示的各種錯誤信息,經分析主要是當前登錄用戶對文件讀寫有權限限制導致,獲取到對/home/hadoop/hadoop-1.0.4文件的讀寫權限之后就不會出現類似問題了,即步驟四-9的方法,參考自網絡,親測可用。
補充說明:
如果需要在terminal中直接運行hadoop命令,還需要在/etc/profile中更改PAHT環境變量,例如:
export HADOOP=/home/hadoop/hadoop-1.0.4
export PATH=$HADOOP/bin:$PATH