idea連接本地虛擬機Hadoop集群運行wordcount


虛擬機搭建hadoop集群,請參考:

 https://www.cnblogs.com/HusterX/p/14125543.html

環境聲明:

1. Hadoop 2.7.0

2 Java 1.8.0

3. window10

4. Vmware workstation pro 16

5. centos7 

window上hadoop的安裝:

1. 將hadoop.tar.gz文件解壓

2. 將  "hadoop安裝路徑"\bin 添加到PATH路徑中

3. 將hadoop.dll文件放到 C:\Windows\System32 目錄下,將winutils.exe文件到 "hadoop安裝路徑"\bin  目錄下 

    PS: 要下載對應版本的 hadoop.dll 和 winutils.exe (如果沒有,盡量用高於自己版本的文件)具體文件在github上找。

 

更改window系統上的hosts文件

路徑:C:\Windows\System32\drivers\etc\hosts

 1 # Copyright (c) 1993-2009 Microsoft Corp.
 2 #
 3 # This is a sample HOSTS file used by Microsoft TCP/IP for Windows.
 4 #
 5 # This file contains the mappings of IP addresses to host names. Each
 6 # entry should be kept on an individual line. The IP address should
 7 # be placed in the first column followed by the corresponding host name.
 8 # The IP address and the host name should be separated by at least one
 9 # space.
10 #
11 # Additionally, comments (such as these) may be inserted on individual
12 # lines or following the machine name denoted by a '#' symbol.
13 #
14 # For example:
15 #
16 #      102.54.94.97     rhino.acme.com          # source server
17 #       38.25.63.10     x.acme.com              # x client host
18 
19 # localhost name resolution is handled within DNS itself.
20 #    127.0.0.1       localhost
21 #    ::1             localhost
22 127.0.0.1       activate.navicat.com
23 # 下邊三個是虛擬機中的IP地址和hostname
24 192.168.47.131  master
25 192.168.47.132  slave1
26 192.168.47.130  slave2
hosts

 

IDEA新建maven項目

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hadoop</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <hadoop.version>2.7.0</hadoop.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

    </dependencies>
</project>
pom.xml

將集群中的 core-site.xmlhdfs-site.xml 放到項目的 resource 目錄下。

PS:以下倆個文件中有修改的地方,win系統與虛擬機中的文件要一致,以免報錯

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/hdfs/tmp</value>
        <discription>A base for other temporary directories.</discription>
    </property>
    <!--建議這的value寫成master的ip地址 同時這也是運行程序時訪問的文件系統的路徑前綴-->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.47.131:9000</value>
    </property>

    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
    </property>

</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!---這也寫為master的ip地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>192.168.47.131:50090</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>192.168.47.131:50090</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hdfs/name</value>
    </property>
   <!--取消權限檢測,筆者在自己win系統上運行是有權限檢測,這樣做很方便,當然也有其他的做法。可自行百度-->
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>

</configuration>
hdfs-site.xml

筆者搭建集群中的配置文件樣例,供參考 

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/hdfs/tmp</value>
        <discription>A base for other temporary directories.</discription>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.47.131:9000</value>
    </property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/hdfs/data</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>192.168.47.131</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>  
        <name>yarn.resourcemanager.address</name>  
        <value>192.168.47.131:8032</value>  
      </property>  
      <property>  
        <name>yarn.resourcemanager.scheduler.address</name>  
        <value>192.168.47.131:8030</value>  
      </property>  
      <property>  
        <name>yarn.resourcemanager.resource-tracker.address</name>  
        <value>192.168.47.131:8031</value>  
   </property> 
</configuration>
yarn-site.xml

 

WordCount運行

1. 測試代碼

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.BasicConfigurator;

import java.io.IOException;


public class HdfsTest {

    public static void main(String[] args) {
        //自動快速地使用缺省Log4j環境。
        BasicConfigurator.configure();
        try {
            String filename = "hdfs://192.168.47.131:9000/words.txt";
            Configuration conf = new Configuration();
            FileSystem fs = null;
            fs = FileSystem.get(conf);
            if (fs.exists(new Path(filename))){
                System.out.println("the file is exist");
            }else{
                System.out.println("the file is not exist");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}
HdfsTest

2. WordCount代碼示例

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.log4j.BasicConfigurator;

/**
 * 單詞統計MapReduce
 */
public class  WordCount {

    /**
     * Mapper類
     */
    public static class WordCountMapper extends MapReduceBase implements Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        /**
         * map方法完成工作就是讀取文件
         * 將文件中每個單詞作為key鍵,值設置為1,
         * 然后將此鍵值對設置為map的輸出,即reduce的輸入
         */
        @Override
        public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            /**
             * StringTokenizer:字符串分隔解析類型
             * 之前沒有發現竟然有這么好用的工具類
             * java.util.StringTokenizer
             * 1. StringTokenizer(String str) :
             *     構造一個用來解析str的StringTokenizer對象。
             *     java默認的分隔符是“空格”、“制表符(‘\t’)”、“換行符(‘\n’)”、“回車符(‘\r’)”。
             * 2. StringTokenizer(String str, String delim) :
             *     構造一個用來解析str的StringTokenizer對象,並提供一個指定的分隔符。
             * 3. StringTokenizer(String str, String delim, boolean returnDelims) :
             *     構造一個用來解析str的StringTokenizer對象,並提供一個指定的分隔符,同時,指定是否返回分隔符。
             *
             * 默認情況下,java默認的分隔符是“空格”、“制表符(‘\t’)”、“換行符(‘\n’)”、“回車符(‘\r’)”。
             */
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                output.collect(word, one);
            }
        }
    }

    /**
     * reduce的輸入即是map的輸出,將相同鍵的單詞的值進行統計累加
     * 即可得出單詞的統計個數,最后把單詞作為鍵,單詞的個數作為值,
     * 輸出到設置的輸出文件中保存
     */
    public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            result.set(sum);
            output.collect(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        //快速使用log4j日志功能
        BasicConfigurator.configure();
        //數據輸入路徑     這里的路徑需要換成自己的hadoop所在地址
        String input = "hdfs://192.168.47.131:9000/words.txt";
        /**
         * 輸出路徑設置為HDFS的根目錄下的out文件夾下
         * 注意:該文件夾不應該存在,否則出錯
         */
        String output = "hdfs://192.168.47.131:9000/out";

        JobConf conf = new JobConf(WordCount.class);
        //設置是誰提交
        conf.setUser("root");
        /**
         * 因為map-reduce過程需要我們自定以的map-reduce類,
         * 因此,我們需要將項目導出為jar包
         * 然后在此處設置導出jar包的位置
         */
        conf.setJar("D:\\ejar\\hadoop.jar");
        //設置作業名稱
        conf.setJobName("wordcount");
        /**
         * 聲明跨平台提交作業
         */
        conf.set("mapreduce.app-submission.cross-platform","true");
        //很重要的聲明
        conf.setJarByClass(WordCount.class);
        //對應單詞字符串
        conf.setOutputKeyClass(Text.class);
        //對應單詞的統計個數 int類型
        conf.setOutputValueClass(IntWritable.class);
        //設置mapper類
        conf.setMapperClass(WordCountMapper.class);
        /**
         * 設置合並函數,合並函數的輸出作為Reducer的輸入,
         * 提高性能,能有效的降低map和reduce之間數據傳輸量。
         * 但是合並函數不能濫用。需要結合具體的業務。
         * 由於本次應用是統計單詞個數,所以使用合並函數不會對結果或者說
         * 業務邏輯結果產生影響。
         * 當對於結果產生影響的時候,是不能使用合並函數的。
         * 例如:我們統計單詞出現的平均值的業務邏輯時,就不能使用合並
         * 函數。此時如果使用,會影響最終的結果。
         */
        conf.setCombinerClass(WordCountReducer.class);
        //設置reduce類
        conf.setReducerClass(WordCountReducer.class);
        /**
         * 設置輸入格式,TextInputFormat是默認的輸入格式
         * 這里可以不寫這句代碼。
         * 它產生的鍵類型是LongWritable類型(代表文件中每行中開始的偏移量值)
         * 它的值類型是Text類型(文本類型)
         */
        conf.setInputFormat(TextInputFormat.class);
        /**
         * 設置輸出格式,TextOutpuTFormat是默認的輸出格式
         * 每條記錄寫為文本行,它的鍵和值可以是任意類型,輸出回調用toString()
         * 輸出字符串寫入文本中。默認鍵和值使用制表符進行分割。
         */
        conf.setOutputFormat(TextOutputFormat.class);
        //設置輸入數據文件路徑
        FileInputFormat.setInputPaths(conf, new Path(input));
        //設置輸出數據文件路徑(該路徑不能存在,否則異常)
        FileOutputFormat.setOutputPath(conf, new Path(output));
        //啟動mapreduce
        JobClient.runJob(conf);
        System.exit(0);
    }

}
WordCount.java

3. 將項目導出為jar包

   請參考:https://www.cnblogs.com/ffaiss/p/10908483.html

4. 運行WordCount

5. 在 hadoop集群中 運行以下命令,查看結果

查看 / 目錄下的文件
hadoop fs - ls /

查看 /out 目錄下的文件內容
因為筆者在wordcount程序中設定的輸出目錄是 /out 所以在這查看該目錄,具體根據自己的實際情況而定
hadoop fs -ls /out

一般運行成功會有以下倆個文件
[root@master ~]# hadoop fs -ls /out
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-12-19 22:45 /out/_SUCCESS
-rw-r--r--   2 root supergroup         85 2020-12-19 22:45 /out/part-00000

查看part-00000的內容(即程序運行結果)
[root@master ~]# hadoop fs -cat /out/part-00000
CAJViewer    1
a    1
free    1
function    1
is    3
it    1
main    1
paper.And    1
software.Its    1
view    1
Order

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM