idea提交任務到yarn中運行


之前運行hadoop的方式是首先編寫好程序,然后再將程序打包成jar包,然后上傳到服務器中運行。

現在的有一種方法是通過在本地idea中可以將jar包提交到遠程集群中運行。

簡單的代碼

一個簡單的例子:


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;

public class WordCount {

    public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
        private final static IntWritable result = new IntWritable(1);
        private final static Text Key = new Text();
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//            super.map(key, value, context);
            String line = value.toString();
            String[] words = line.split(" ");
            for(String word: words){
                Key.set(word);
                context.write(Key, result);
            }

        }
    }


    public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
        private static final IntWritable result = new IntWritable();

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value: values){
                sum += value.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        // 設置Hadoop的用戶名
        System.setProperty("HADOOP_USER_NAME", "611");
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.1.237:9000");
        conf.set("mapreduce.app-submission.cross-platform", "true");
        //jar包運行地址
        conf.set("mapred.jar", "D:\\MyFile\\實驗室項目\\2021大數據項目\\out\\artifacts\\wordCount\\wordCount.jar");
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        // 設置yarn連接
        //指定是job運行在yarn上,默認local
        conf.set("mapreduce.framework.name", "yarn");
        //解決ould only be replicated to 0 nodes instead of minReplication (=1).
        //There are 1 datanode(s) running and 1 node(s) are excluded in this operation問題
        conf.set("yarn.resourcemanager.hostname", "master");
        // client通過hostname可以連接datanode
        conf.set("dfs.client.use.datanode.hostname", "true");

        Job job = Job.getInstance(conf, "wordCount"); //任務名稱
        job.setJarByClass(WordCount.class);

        String inputFile = "hdfs://192.168.1.237:9000/test/wordCount.txt";
        String outFile = "hdfs://192.168.1.237:9000/test/output";

        // 刪除out目錄
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.1.237:9000"), conf);
        if(fs.exists(new Path(outFile)))
            fs.delete(new Path(outFile), true);

        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(inputFile));
        FileOutputFormat.setOutputPath(job, new Path(outFile));

        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

其中有一些設置關鍵的地方:

  • yarn.resourcemanager.hostname:設置提交的client的名字,一般是自己在yarn-site.xml設置的內容
  • mapreduce.app-submission.cross-platform:這個是設置app跨平台提交,設置為true
  • mapred.jar:這個就是本程序中需要生成的jar包(必須生成jar包,才能進行遠程提交任務,如果生成jar不正確,將會導致任務的失敗!!),在示例中的mainClass就是WordCount

依賴

然后在這個程序中需要導入到包包括基本的hadoop、yarn的依賴包,具體需要不需要哪些,我把所有可能依賴的包全部都加入進去了,但是有一些依賴可能不是必須的,所以大家可以自己嘗試一下。

<dependencies>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.7.7</version>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-shuffle</artifactId>
            <version>2.7.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-client</artifactId>
            <version>2.7.7</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-api -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-api</artifactId>
            <version>2.7.7</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-common</artifactId>
            <version>2.7.7</version>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-jobclient -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.7</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-common</artifactId>
            <version>2.7.7</version>
        </dependency>
    </dependencies>

配置xml

在使用這個遠程提交的時候,有一些yarn-site.xml和mapred-site.xml必須得配置好:

mapred-site.xml中的一些有關配置:

<configuration>
	<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
	<!-- 配置的是history的查看地址 -->
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>master:10020</value>
	</property>
	<!-- 配置的是history網頁的查看地址 -->
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>master:19888</value>
	</property>
</configuration>

yarn-site.xml中的一些有關配置:

	<property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
      </property>
      <!-- 配置yarn的resourcemanager的主機,這個是必須配置的 -->
      <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>master</value>
      </property>
       <property>
          <name>yarn.nodemanager.pmem-check-enabled</name>
          <value>false</value>
      </property>
      <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
      </property>
	  <!-- 配置yarn運行過程中產生的記錄聚合 -->
	  <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
	<!-- 聚合后的history的查看地址 -->
	<property>
		<name>yarn.log.server.url</name>
		<value>master:19888/jobhistory</value>
	</property>

如果需要啟動hadoop的history記錄服務,必須是單獨的啟動

${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh start historyserver

查看運行的job

可以通過master:8088查看yarn運行的任務和狀態
yarn監控
如果想要查看history的,必須開啟history server,如上面的命令所示


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM