Hadoop詳解(04-1) - 基於hadoop3.1.3配置Windows10本地開發運行環境
環境准備
- 安裝jdk環境
- 安裝idea
- 配置maven
- 搭建好的hadoop集群
配置hadoop
- 解壓hadoopo
將hadoop壓縮包hadoop-3.1.3.tar.gz解壓到本地任意目錄
- 拷貝Windows依賴到本地目錄
Hadoop的Windows依賴說明
hadoop在windows上運行需要winutils支持和hadoop.dll等文件,hadoop主要基於linux編寫,hadoop.dll和winutil.exe主要用於模擬linux下的目錄環境,如果缺少這兩個文件在本地調試MR程序會報錯
缺少winutils.exe
Could not locate executable null \bin\winutils.exe in the hadoop binaries
缺少hadoop.dll
Unable to load native-hadoop library for your platform… using builtin-Java classes where applicable
Windows依賴文件官方沒有直接提供,需要自行下載。
如在gitubxiaz(版本不全) https://github.com/4ttty/winutils
- 配置環境變量
添加HADOOP_HOME並編輯Path的值
- 查看hadoop版本
通過查看hadoop版本確認windows下的hadoop環境變量配置是否成功
C:\Users\Administrator> hadoop version
創建項目
- 創建maven項目HadoopDemo
- 導入相應的依賴坐標
在pom.xml文件中添加依賴坐標
Hadoop開發環境只需要引入hadoop-client即可,hadoop-client的依賴關系已經包含了client、common、hdfs、mapreduce、yarn等模塊
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
- 和添加日志
在項目的src/main/resources目錄下,新建一個文件,命名為"log4j2.xml",在文件中填入
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="error" strict="true" name="XMLConfig">
<Appenders>
<!-- 類型名為Console,名稱為必須屬性 -->
<Appender type="Console" name="STDOUT">
<!-- 布局為PatternLayout的方式,
輸出樣式為[INFO] [2018-01-22 17:34:01][org.test.Console]I'm here -->
<Layout type="PatternLayout"
pattern="[%p] [%d{yyyy-MM-dd HH:mm:ss}][%c{10}]%m%n" />
</Appender>
</Appenders>
<Loggers>
<!-- 可加性為false -->
<Logger name="test" level="info" additivity="false">
<AppenderRef ref="STDOUT" />
</Logger>
<!-- root loggerConfig設置 -->
<Root level="info">
<AppenderRef ref="STDOUT" />
</Root>
</Loggers>
</Configuration>
本地測試hdfs
- 需求
在hdfs中創建目錄:/1128/daxian/banzhang
- 創建包名:com.zhangjk.hdfs
- 創建HdfsClient類並編寫代碼
-
package com.zhangjk.hdfs;
-
-
import org.apache.hadoop.conf.Configuration;
-
import org.apache.hadoop.fs.FileSystem;
-
import org.apache.hadoop.fs.Path;
-
import org.junit.Test;
-
import java.io.IOException;
-
import java.net.URI;
-
import java.net.URISyntaxException;
-
-
/**
-
* @author : 張京坤
-
* mail:zhangjingkun88@126.com
-
* date: 2021/11/28
-
* project name: HdfsClientDemo
-
* package name: com.zhangjk.hdfs
-
* content:
-
* @version :1.0
-
*/
-
public class HdfsClient {
-
-
@Test
-
public void testMkdirs() throws IOException, InterruptedException, URISyntaxException {
-
-
// 1 獲取文件系統
-
Configuration configuration = new Configuration();
-
-
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9820"), configuration, "hadoop");
-
-
// 2 創建目錄
-
fs.mkdirs(new Path("/1128/daxian/banzhang"));
-
-
// 3 關閉資源
-
fs.close();
-
}
-
-
}
- 要配置用戶名稱
客戶端去操作HDFS時,是有一個用戶身份的。默認情況下,HDFS客戶端API會從JVM中獲取一個參數來作為自己的用戶身份:通過在VM options中設置參數-DHADOOP_USER_NAME=hadoop,hadoop為用戶名稱。
- 執行程序
運行程序並查看結果
本地測試MR程序WordCount
- 需求
在給定的文本文件hello.txt中統計輸出每一個單詞出現的總次數
hello.txt文件中的內容
hadoop hadoop
ss ss
cls cls
jiao
banzhang
xue
hadoop
期望輸出數據
hadoop 2
banzhang 1
cls 2
hadoop 1
jiao 1
ss 2
xue 1
- 需求分析
按照MapReduce編程規范,分別編寫Mapper,Reducer,Driver。
輸入數據
hadoop hadoop
ss ss
cls cls
jiao
banzhang
xue
hadoop
輸出數據
hadoop 2
banzhang1
cls 2
hadoop 1
jiao 1
ss 2
xue 1
Mapper階段
1 將MapTask傳過來的文本內容先轉換成String
hadoop hadoop
2 根據空格將這一行切分成單詞
hadoop
hadoop
3 將單詞輸出為<單詞,1>
hadoop, 1
hadoop, 1
Reducer階段
1 匯總各個key的個數
hadoop, 1
hadoop, 1
2 輸出該key的總次數
hadoop, 2
Driver階段
1 獲取配置信息,獲取job對象實例
2 指定本程序的jar包所在的本地路徑
3 關聯Mapper/Reducer業務類
4 指定Mapper輸出數據的kv類型
5 指定最終輸出的數據的kv類型
6 指定job的輸入原始文件所在目錄
7 指定job的輸出結果所在目錄
8 提交作業
- 創建包名:com.zhangjk.mapreduce
創建WordcountMapper、WordcountReducer、WordcountDriver類並編寫代碼
Mapper
-
package com.zhangjk.mapreduce;
-
-
import org.apache.hadoop.io.IntWritable;
-
import org.apache.hadoop.io.LongWritable;
-
import org.apache.hadoop.io.Text;
-
import org.apache.hadoop.mapreduce.Mapper;
-
-
import java.io.IOException;
-
-
/**
-
* @author : 張京坤
-
* mail:zhangjingkun88@126.com
-
* date: 2021/12/2
-
* project name: HdfsClientDemo
-
* package name: com.zhangjk.mapreduce
-
* content:
-
* @version :1.0
-
*/
-
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
-
//常見kv對變量
-
Text k = new Text();
-
IntWritable v = new IntWritable(1);
-
-
@Override
-
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
-
//獲取一行
-
String line = value.toString();
-
//切割
-
String[] words = line.split(" ");
-
//輸出
-
for (String word : words) {
-
k.set(word);
-
context.write(k, v);
-
}
-
}
-
}
Reducer
-
package com.zhangjk.mapreduce;
-
-
import org.apache.hadoop.io.IntWritable;
-
import org.apache.hadoop.io.Text;
-
import org.apache.hadoop.mapreduce.Reducer;
-
import java.io.IOException;
-
-
/**
-
* @author : 張京坤
-
* mail:zhangjingkun88@126.com
-
* date: 2021/12/2
-
* project name: HdfsClientDemo
-
* package name: com.zhangjk.mapreduce
-
* content:
-
* @version :1.0
-
*/
-
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
-
int sum;
-
IntWritable v = new IntWritable();
-
-
@Override
-
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
-
//累加求和
-
sum = 0;
-
for (IntWritable value : values) {
-
sum += value.get();
-
}
-
//輸出
-
v.set(sum);
-
context.write(key, v);
-
}
-
}
Driver驅動類
-
package com.zhangjk.mapreduce;
-
-
import org.apache.hadoop.conf.Configuration;
-
import org.apache.hadoop.fs.Path;
-
import org.apache.hadoop.io.IntWritable;
-
import org.apache.hadoop.io.Text;
-
import org.apache.hadoop.mapreduce.Job;
-
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
-
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
-
-
import java.io.IOException;
-
-
/**
-
* @author : 張京坤
-
* mail:zhangjingkun88@126.com
-
* date: 2021/12/2
-
* project name: HdfsClientDemo
-
* package name: com.zhangjk.mapreduce
-
* content:
-
* @version :1.0
-
*/
-
public class WordcountDriver {
-
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
-
//1 獲取配置信息和job對象
-
Configuration configuration = new Configuration();
-
Job job = Job.getInstance(configuration);
-
//2 關聯本Dirver程序的jar
-
job.setJarByClass(WordcountDriver.class);
-
//3 關聯Mapper和Reducer的jar
-
job.setMapperClass(WordcountMapper.class);
-
job.setReducerClass(WordcountReducer.class);
-
//4 設置Mapper輸出的kv類型
-
job.setMapOutputKeyClass(Text.class);
-
job.setMapOutputValueClass(IntWritable.class);
-
//5 設置最終輸出的kv類型
-
job.setOutputKeyClass(Text.class);
-
job.setOutputValueClass(IntWritable.class);
-
//6 設置輸入和輸出路徑
-
FileInputFormat.setInputPaths(job, new Path(args[0]));
-
FileOutputFormat.setOutputPath(job, new Path(args[1]));
-
//7 提交job
-
boolean result = job.waitForCompletion(true);
-
System.out.println(result);
-
}
-
}
- 運行測試
配置agrs參數
在啟動類配置的Program agtuments中分別設置input 和output 並用空格分隔
解決報錯:
啟動WordcountDriver類時會遇到如下錯誤信息:
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
錯誤日志信息
-
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
-
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
-
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:640)
-
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1223)
-
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:160)
-
at org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:100)
-
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:77)
-
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:315)
-
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:378)
-
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:152)
-
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:133)
-
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:117)
-
at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:124)
-
at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:172)
-
at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:788)
-
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:251)
-
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
-
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
-
at java.security.AccessController.doPrivileged(Native Method)
-
at javax.security.auth.Subject.doAs(Subject.java:422)
-
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
-
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
-
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
-
at com.zhangjk.mapreduce.WordcountDriver.main(WordcountDriver.java:47)
錯誤原因:
在新版本的windows系統中,會取消部分文件,某些功能無法支持。本地的NativeIO無法寫入
解決方法:再寫一個NativeIO類替代源代碼
操作步驟:
- 在項目的java目錄下重建一個org.apache.hadoop.io.nativeio包和NativeIO類
- 按2次shift查找NativeIO類
- 選擇hadoop-common jar包中的org.apache.hadoop.io.nativeio.NativeIO類進入到對應的源碼文件,如果沒有下載源碼,則需要點擊Download Sources下載源碼
- 在源代碼的org.apache.hadoop.io.nativeio.NativeIO類中Ctrl+a全選,Ctrl+c復制所有代碼
- 將復制的代碼覆蓋到第1步創建的NativeIO類中(Ctrl+a全選,Ctrl+v粘貼)
- Ctrl+f查找return access0
- 將本行代碼修改成return true;
再次運行查看結果
再次啟動WordcountDriver類,已沒有錯誤信息並正常查看日志,進入到output輸出目錄查看運行結果
集群測試MR程序WordCount
-
在pom.xml中添加用maven打jar包所需要的打包插件依賴
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
注意:如果工程上顯示紅叉。在項目上右鍵->maven->Reimport即可
- 程序打成jar包
使用maven插件對項目打包
等待編譯完成就會在項目的target文件夾中生成jar包
如果看不到。在項目上右鍵->Refresh即可
其中
HdfsClientDemo-1.0-SNAPSHOT.jar為不帶依賴的jar包
HdfsClientDemo-1.0-SNAPSHOT-jar-with-dependencies.jar是帶依賴的jar包
修改不帶依賴的jar包名稱為wc.jar,並拷貝該jar包到Hadoop集群
hadoop集群上已經包含了執行MR程序所需要的依賴,所以在集群上運行MR程序時選擇不帶依賴的jar包
上傳hello.txt文件到hdfs
[hadoop@hadoop102 ~]$ hadoop fs -put /home/hadoop/hello.txt /user/hadoop
提交任務
[hadoop@hadoop102 ~]$ hadoop jar wc.jar com.zhangjk.mapreduce.WordcountDriver /user/hadoop/hello.txt /user/hadoop/output
在yarn平台上查看正在運行的任務
在hdfs中查看執行結果
在Windows上向集群提交任務
復制WordcountDriver類到同一包下命名為WordcountDriverWin(也可直接在WordcountDriver類上修改)
- 添加必要配置信息,6、8、10、12行
-
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
-
//1 獲取配置信息和job對象
-
Configuration configuration = new Configuration();
-
-
//設置HDFS NameNode的地址
-
configuration.set("fs.defaultFS", "hdfs://hadoop102:9820");
-
// 指定MapReduce運行在Yarn上
-
configuration.set("mapreduce.framework.name","yarn");
-
// 指定mapreduce可以在遠程集群運行
-
configuration.set("mapreduce.app-submission.cross-platform","true");
-
//指定Yarn resourcemanager的位置
-
configuration.set("yarn.resourcemanager.hostname","hadoop102");
-
-
Job job = Job.getInstance(configuration);
-
//2 關聯本Dirver程序的jar
-
// job.setJarByClass(WordcountDriverWin.class);
-
job.setJar("D:\\projects\\code02\\HdfsClientDemo\\target\\HdfsClientDemo-1.0-SNAPSHOT.jar");
-
//3 關聯Mapper和Reducer的jar
-
job.setMapperClass(WordcountMapper.class);
-
job.setReducerClass(WordcountReducer.class);
-
//4 設置Mapper輸出的kv類型
-
job.setMapOutputKeyClass(Text.class);
-
job.setMapOutputValueClass(IntWritable.class);
-
//5 設置最終輸出的kv類型
-
job.setOutputKeyClass(Text.class);
-
job.setOutputValueClass(IntWritable.class);
-
//6 設置輸入和輸出路徑
-
FileInputFormat.setInputPaths(job, new Path(args[0]));
-
FileOutputFormat.setOutputPath(job, new Path(args[1]));
-
// //7 提交job
-
boolean result = job.waitForCompletion(true);
-
System.out.println(result);
-
}
- 編輯任務配置
檢查第一個參數Main class是不是要運行的類的全類名,如果不是要修改!
在VM options后面加上 :-DHADOOP_USER_NAME=hadoop
在Program arguments后面加上兩個參數分別代表輸入輸出路徑,兩個參數之間用空格隔開。如:hdfs://hadoop102:9820/user/hadoop/hello.txt hdfs://hadoop102:9820/user/hadoop/output1
- 重新打包,並將Jar包設置到Driver中
Maven Projects --> Lifecyle --> install
-
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
-
//1 獲取配置信息和job對象
-
Configuration configuration = new Configuration();
-
-
//設置HDFS NameNode的地址
-
configuration.set("fs.defaultFS", "hdfs://hadoop102:9820");
-
// 指定MapReduce運行在Yarn上
-
configuration.set("mapreduce.framework.name","yarn");
-
// 指定mapreduce可以在遠程集群運行
-
configuration.set("mapreduce.app-submission.cross-platform","true");
-
//指定Yarn resourcemanager的位置
-
configuration.set("yarn.resourcemanager.hostname","hadoop102");
-
-
Job job = Job.getInstance(configuration);
-
//2 關聯本Dirver程序的jar
-
// job.setJarByClass(WordcountDriverWin.class);
-
job.setJar("D:\\projects\\code02\\HdfsClientDemo\\target\\HdfsClientDemo-1.0-SNAPSHOT.jar");
-
//3 關聯Mapper和Reducer的jar
-
job.setMapperClass(WordcountMapper.class);
-
job.setReducerClass(WordcountReducer.class);
-
//4 設置Mapper輸出的kv類型
-
job.setMapOutputKeyClass(Text.class);
-
job.setMapOutputValueClass(IntWritable.class);
-
//5 設置最終輸出的kv類型
-
job.setOutputKeyClass(Text.class);
-
job.setOutputValueClass(IntWritable.class);
-
//6 設置輸入和輸出路徑
-
FileInputFormat.setInputPaths(job, new Path(args[0]));
-
FileOutputFormat.setOutputPath(job, new Path(args[1]));
-
// //7 提交job
-
boolean result = job.waitForCompletion(true);
-
System.out.println(result);
-
}
- 提交並查看結果
yarn平台上查看任務正在運行
也可在hdfs中查看輸出結果