之前運行hadoop的方式是首先編寫好程序,然后再將程序打包成jar包,然后上傳到服務器中運行。
現在的有一種方法是通過在本地idea中可以將jar包提交到遠程集群中運行。
簡單的代碼
一個簡單的例子:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.net.URI;
public class WordCount {
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable result = new IntWritable(1);
private final static Text Key = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// super.map(key, value, context);
String line = value.toString();
String[] words = line.split(" ");
for(String word: words){
Key.set(word);
context.write(Key, result);
}
}
}
public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
private static final IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value: values){
sum += value.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
// 設置Hadoop的用戶名
System.setProperty("HADOOP_USER_NAME", "611");
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://192.168.1.237:9000");
conf.set("mapreduce.app-submission.cross-platform", "true");
//jar包運行地址
conf.set("mapred.jar", "D:\\MyFile\\實驗室項目\\2021大數據項目\\out\\artifacts\\wordCount\\wordCount.jar");
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
// 設置yarn連接
//指定是job運行在yarn上,默認local
conf.set("mapreduce.framework.name", "yarn");
//解決ould only be replicated to 0 nodes instead of minReplication (=1).
//There are 1 datanode(s) running and 1 node(s) are excluded in this operation問題
conf.set("yarn.resourcemanager.hostname", "master");
// client通過hostname可以連接datanode
conf.set("dfs.client.use.datanode.hostname", "true");
Job job = Job.getInstance(conf, "wordCount"); //任務名稱
job.setJarByClass(WordCount.class);
String inputFile = "hdfs://192.168.1.237:9000/test/wordCount.txt";
String outFile = "hdfs://192.168.1.237:9000/test/output";
// 刪除out目錄
FileSystem fs = FileSystem.get(new URI("hdfs://192.168.1.237:9000"), conf);
if(fs.exists(new Path(outFile)))
fs.delete(new Path(outFile), true);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(inputFile));
FileOutputFormat.setOutputPath(job, new Path(outFile));
System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
其中有一些設置關鍵的地方:
- yarn.resourcemanager.hostname:設置提交的client的名字,一般是自己在yarn-site.xml設置的內容
- mapreduce.app-submission.cross-platform:這個是設置app跨平台提交,設置為true
- mapred.jar:這個就是本程序中需要生成的jar包(必須生成jar包,才能進行遠程提交任務,如果生成jar不正確,將會導致任務的失敗!!),在示例中的mainClass就是WordCount
依賴
然后在這個程序中需要導入到包包括基本的hadoop、yarn的依賴包,具體需要不需要哪些,我把所有可能依賴的包全部都加入進去了,但是有一些依賴可能不是必須的,所以大家可以自己嘗試一下。
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
<version>2.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
<version>2.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-api -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
<version>2.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-jobclient -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.7.7</version>
</dependency>
</dependencies>
配置xml
在使用這個遠程提交的時候,有一些yarn-site.xml和mapred-site.xml必須得配置好:
mapred-site.xml中的一些有關配置:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 配置的是history的查看地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<!-- 配置的是history網頁的查看地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>
yarn-site.xml中的一些有關配置:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 配置yarn的resourcemanager的主機,這個是必須配置的 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 配置yarn運行過程中產生的記錄聚合 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 聚合后的history的查看地址 -->
<property>
<name>yarn.log.server.url</name>
<value>master:19888/jobhistory</value>
</property>
如果需要啟動hadoop的history記錄服務,必須是單獨的啟動
${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh start historyserver
查看運行的job
可以通過master:8088查看yarn運行的任務和狀態
如果想要查看history的,必須開啟history server,如上面的命令所示