環境
- VirtualBox 6.1
- IntelliJ IDEA 2020.1.1
- Ubuntu-18.04.4-live-server-amd64
- jdk-8u251-linux-x64
- hadoop-2.7.7
安裝偽分布式Hadoop
安裝偽分布式參考:Hadoop安裝教程_單機/偽分布式配置_Hadoop2.6.0(2.7.1)/Ubuntu14.04(16.04)
這里就不再累述,注意需要安裝yarn。
還就是我使用的是僅主機網絡模式。
啟動成功后,使用jps
,顯示應該有以下幾項:
修改配置
首先使用ifconfig
查看本機IP,我這里是192.168.56.101
,下面將使用該IP為例進行展示。
修改core-site.xml
,將localhost
改為服務器IP
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.56.101:9000</value>
</property>
修改mapred-site.xml
,添加mapreduce.jobhistory.address
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.56.101:10020</value>
</property>
不添加這項,會報如下錯
[main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
修改yarn-site.xml
,添加如下項
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.56.101:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.56.101:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.56.101:8031</value>
</property>
如果不添加,將會報錯
INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
配置完成后,需要重啟dfs
、yarn
和historyserver
。
配置Windows的Hadoop運行環境
首先把Linux中的hadoop-2.7.7.tar.gz
解壓到Windows
的一個目錄,本文中是D:\ProgramData\hadoop
然后配置環境變量:
HADOOP_HOME=D:\ProgramData\hadoop
HADOOP_BIN_PATH=%HADOOP_HOME%\bin
HADOOP_PREFIX=D:\ProgramData\hadoop
另外,PATH變量在最后追加;%HADOOP_HOME%\bin
然后去下載winutils
,下載地址在https://github.com/cdarlint/winutils,找到對應版本下載,這里下載的2.7.7
版本。
將winutils.exe
復制到$HADOOP_HOME\bin
目錄,將hadoop.dll
復制到C:\Windows\System32
目錄。
編寫WordCount
首先創建數據文件wc.txt
hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark
然后移動到Linux中去,在使用hdfs dfs -put /path/wc.txt ./input
將數據文件放入到dfs中
然后使用IDEA新建maven項目,修改pom.xml
文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>WordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<repositories>
<repository>
<id>aliyun</id>
<name>aliyun</name>
<url>https://maven.aliyun.com/repository/central/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
</build>
</project>
接着就是編寫WordCount
程序,這里我參考的是
https://www.cnblogs.com/frankdeng/p/9256254.html
然后修改一下WordcountDriver.
package cabbage;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* 相當於一個yarn集群的客戶端,
* 需要在此封裝我們的mr程序相關運行參數,指定jar包
* 最后提交給yarn
*/
public class WordcountDriver {
/**
* 刪除指定目錄
*
* @param conf
* @param dirPath
* @throws IOException
*/
private static void deleteDir(Configuration conf, String dirPath) throws IOException {
FileSystem fs = FileSystem.get(conf);
Path targetPath = new Path(dirPath);
if (fs.exists(targetPath)) {
boolean delResult = fs.delete(targetPath, true);
if (delResult) {
System.out.println(targetPath + " has been deleted sucessfullly.");
} else {
System.out.println(targetPath + " deletion failed.");
}
}
}
public static void main(String[] args) throws Exception {
System.setProperty("HADOOP_USER_NAME", "hadoop");
// 1 獲取配置信息,或者job對象實例
Configuration configuration = new Configuration();
System.setProperty("hadoop.home.dir", "D:\\ProgramData\\hadoop");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("fs.default.name", "hdfs://192.168.56.101:9000");
configuration.set("mapreduce.app-submission.cross-platform", "true");//跨平台提交
configuration.set("mapred.jar","D:\\Work\\Study\\Hadoop\\WordCount\\target\\WordCount.jar");
// 8 配置提交到yarn上運行,windows和Linux變量不一致
// configuration.set("mapreduce.framework.name", "yarn");
// configuration.set("yarn.resourcemanager.hostname", "node22");
//先刪除output目錄
deleteDir(configuration, args[args.length - 1]);
Job job = Job.getInstance(configuration);
// 6 指定本程序的jar包所在的本地路徑
// job.setJar("/home/admin/wc.jar");
job.setJarByClass(WordcountDriver.class);
// 2 指定本業務job要使用的mapper/Reducer業務類
job.setMapperClass(WordcountMapper.class);
job.setCombinerClass(WordcountReducer.class);
job.setReducerClass(WordcountReducer.class);
// 3 指定mapper輸出數據的kv類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 4 指定最終輸出的數據的kv類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 5 指定job的輸入原始文件所在目錄
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 將job中配置的相關參數,以及job所用的java類所在的jar包, 提交給yarn去運行
// job.submit();
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}
關鍵代碼
System.setProperty("HADOOP_USER_NAME", "hadoop");
如果不添加這行,會導致權限報錯
org.apache.hadoop.ipc.RemoteException: Permission denied: user=administration, access=WRITE, inode="/":root:supergroup:drwxr-xr-x
如果修改了還是報錯,可以考慮將文件權限修改777
這里我主要參考一下幾篇文章
https://www.cnblogs.com/acmy/archive/2011/10/28/2227901.html
https://blog.csdn.net/jzy3711/article/details/85003606
System.setProperty("hadoop.home.dir", "D:\\ProgramData\\hadoop");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("fs.default.name", "hdfs://192.168.56.101:9000");
configuration.set("mapreduce.app-submission.cross-platform", "true");//跨平台提交
configuration.set("mapred.jar","D:\\Work\\Study\\Hadoop\\WordCount\\target\\WordCount.jar");
如果不添加這行會報錯
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class cabbage.WordcountMapper not found
這里我主要參考了
https://blog.csdn.net/u011654631/article/details/70037219
//先刪除output目錄
deleteDir(configuration, args[args.length - 1]);
output每次運行時候不會覆蓋,如果不刪除會報錯,這里應該都知道。
添加依賴
然后添加依賴的Libary引用,項目上右擊 -> Open Module Settings
或按F12,打開模塊屬性
然后點擊Dependencies
->右邊的加號->Libray
接着把$HADOOP_HOME
下的對應包全導進來
然后再導入$HADOOP_HOME\share\hadoop\tools\lib
然后使用maven的package
打包jar包
添加resources
在resources
中新建log4j.properties
,添加如下內容
log4j.rootLogger=INFO, stdout
#log4j.logger.org.springframework=INFO
#log4j.logger.org.apache.activemq=INFO
#log4j.logger.org.apache.activemq.spring=WARN
#log4j.logger.org.apache.activemq.store.journal=INFO
#log4j.logger.org.activeio.journal=INFO
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n
然后再把Linux里面的core-site.xml
、hdfs-site.xml
、mapred-site.xml
和yarn-site.xml
移動過來,最后的項目結構如下圖
配置IDEA
上面配置完了,就可以設置運行參數了
注意兩個地方
-
Program aguments,指定輸入文件和輸出文件夾,注意是
hdsf://ip:9000/user/hadoop/xxx
-
Working Directory,即工作目錄,指定為$HADOOP_HOME所在目錄
運行
點擊運行即可,如果報錯說缺少依賴,比如我就報錯缺少slf4j-log
這個包,然后就自己添加到依賴里面就行了。
運行完成后IDEA顯示如下圖:
然后再output文件中看看輸出結果,在Linux里面輸入hdfs dfs -cat ./output/*
,顯示如下結果,就正確了。
如果有什么問題,可以再評論區提出來,一起討論;)。
參考
- http://dblab.xmu.edu.cn/blog/install-hadoop/
- https://blog.csdn.net/u011654631/article/details/70037219
- https://www.cnblogs.com/yjmyzz/p/how-to-remote-debug-hadoop-with-eclipse-and-intellij-idea.html
- https://www.cnblogs.com/frankdeng/p/9256254.html
- https://www.cnblogs.com/acmy/archive/2011/10/28/2227901.html
- https://blog.csdn.net/djw745917/article/details/88703888
- https://www.jianshu.com/p/7a1f131469f5