1、創建hadoop用戶,hadoopgroup組
groupadd -g 102 hadoopgroup # 創建用戶組 useradd -d /opt/hadoop -u 10201 -g 102 hadoop #創建用戶 passwd hadoop #給用戶設置密碼
2、安裝ftp工具
yum -y install vsftpd 啟動ftp:systemctl start vsftpd.service 停止ftp:systemctl stop vsftpd.service 重啟ftp:systemctl restart vsftpd.service systemctl start vsftpd.service # 啟動,無提示信息 ps -ef|grep vsft #查看進程已存在,直接使用ftp工具連接 root 1257 1 0 09:41 ? 00:00:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf root 1266 1125 0 09:42 pts/0 00:00:00 grep --color=auto vsft systemctl restart vsftpd.service
2、安裝jdk、hadoop
- 將下載的jdk、hadoop拷貝到服務器上,解壓,修改目錄名
- 修改目錄名,是為了方便書寫

3、配置Java、hadoop環境變量
在最后添加Java、hadoop環境變量,注意路徑不要寫錯即可
vim .bashrc more .bashrc #.bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # Uncomment the following line if you don't like systemctl's auto-paging feature: # export SYSTEMD_PAGER= # User specific aliases and functions #jdk export JAVA_HOME=/opt/hadoop/jdk1.8 export JRE_HOME=${JAVA_HOME}/jre export CLASS_PATH=${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH #hadoop export HADOOP_HOME=/opt/hadoop/hadoop3 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
4、切換root用戶,修改各機/etc/hosts
vim /etc/hosts more /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.80.5 venn05 192.168.80.6 venn06 192.168.80.7 venn07
其他幾台機操作相同
5、創建ssh密鑰
mkdir .ssh # 創建.ssh 目錄 cd .ssh/ ls pwd /opt/hadoop/.ssh ssh-keygen -t rsa -P ‘’ # 創建ssh秘鑰,一路回車下去
每台機都執行以上步驟,創建 ssh 密鑰
修改/etc/ssh/sshd_config文件
#$OpenBSD: sshd_config,v 1.100 2016/08/15 12:32:04 naddy Exp $ # This is the sshd server system-wide configuration file. See # sshd_config(5) for more information. # This sshd was compiled with PATH=/usr/local/bin:/usr/bin # The strategy used for options in the default sshd_config shipped with # OpenSSH is to specify options with their default value where # possible, but leave them commented. Uncommented options override the # default value. # If you want to change the port on a SELinux system, you have to tell # SELinux about this change. # semanage port -a -t ssh_port_t -p tcp #PORTNUMBER # #Port 22 #AddressFamily any #ListenAddress 0.0.0.0 #ListenAddress :: HostKey /etc/ssh/ssh_host_rsa_key #HostKey /etc/ssh/ssh_host_dsa_key HostKey /etc/ssh/ssh_host_ecdsa_key HostKey /etc/ssh/ssh_host_ed25519_key # Ciphers and keying #RekeyLimit default none # Logging #SyslogFacility AUTH SyslogFacility AUTHPRIV #LogLevel INFO # Authentication: #LoginGraceTime 2m #PermitRootLogin yes #StrictModes yes #MaxAuthTries 6 #MaxSessions 10 #PubkeyAuthentication yes # The default is to check both .ssh/authorized_keys and .ssh/authorized_keys2 # but this is overridden so installations will only check .ssh/authorized_keys AuthorizedKeysFile .ssh/authorized_keys #AuthorizedPrincipalsFile none #AuthorizedKeysCommand none #AuthorizedKeysCommandUser nobody # For this to work you will also need host keys in /etc/ssh/ssh_known_hosts #HostbasedAuthentication no # Change to yes if you don't trust ~/.ssh/known_hosts for # HostbasedAuthentication #IgnoreUserKnownHosts no # Don't read the user's ~/.rhosts and ~/.shosts files #IgnoreRhosts yes # To disable tunneled clear text passwords, change to no here! #PasswordAuthentication yes #PermitEmptyPasswords no PasswordAuthentication yes # Change to no to disable s/key passwords #ChallengeResponseAuthentication yes ChallengeResponseAuthentication no # Kerberos options #KerberosAuthentication no #KerberosOrLocalPasswd yes #KerberosTicketCleanup yes #KerberosGetAFSToken no #KerberosUseKuserok yes # GSSAPI options GSSAPIAuthentication yes GSSAPICleanupCredentials no #GSSAPIStrictAcceptorCheck yes #GSSAPIKeyExchange no #GSSAPIEnablek5users no # Set this to 'yes' to enable PAM authentication, account processing, # and session processing. If this is enabled, PAM authentication will # be allowed through the ChallengeResponseAuthentication and # PasswordAuthentication. Depending on your PAM configuration, # PAM authentication via ChallengeResponseAuthentication may bypass # the setting of "PermitRootLogin without-password". # If you just want the PAM account and session checks to run without # PAM authentication, then enable this but set PasswordAuthentication # and ChallengeResponseAuthentication to 'no'. # WARNING: 'UsePAM no' is not supported in Red Hat Enterprise Linux and may cause several # problems. UsePAM yes #AllowAgentForwarding yes #AllowTcpForwarding yes #GatewayPorts no X11Forwarding yes #X11DisplayOffset 10 #X11UseLocalhost yes #PermitTTY yes #PrintMotd yes #PrintLastLog yes #TCPKeepAlive yes #UseLogin no #UsePrivilegeSeparation sandbox #PermitUserEnvironment no #Compression delayed #ClientAliveInterval 0 #ClientAliveCountMax 3 #ShowPatchLevel no #UseDNS yes #PidFile /var/run/sshd.pid #MaxStartups 10:30:100 #PermitTunnel no #ChrootDirectory none #VersionAddendum none # no default banner path #Banner none # Accept locale-related environment variables AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT AcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE AcceptEnv XMODIFIERS # override default of no subsystems Subsystem sftp /usr/libexec/openssh/sftp-server # Example of overriding settings on a per-user basis #Match User anoncvs # X11Forwarding no # AllowTcpForwarding no # PermitTTY no # ForceCommand cvs server
重啟服務:systemctl restart sshd
6、合並每台機器的公鑰,放到每台機器上
Venn05 : 復制公鑰到文件 : cat id_rsa.pub >> authorized_keys 拷貝到 venn 06 : scp authorized_keys hadoop@venn06:~/.ssh/authorized_keys Venn 06 : 拷貝venn06的公鑰到 authorized_keys : cat id_rsa.pub >> authorized_keys 拷貝到 venn07 : scp authorized_keys hadoop@venn07:~/.ssh/authorized_keys Venn07 : 復制公鑰到文件 : cat id_rsa.pub >> authorized_keys 拷貝到 venn 05 : scp authorized_keys hadoop@venn05:~/.ssh/authorized_keys 拷貝到 venn 06 : scp authorized_keys hadoop@venn06:~/.ssh/authorized_keys
多機類推
至此,配置完成,現在各機hadoop用戶可以免密登錄。
7、修改 hadoop環境配置:hadoop-env.sh
進入路徑: /opt/hadoop/hadoop3/etc/hadoop,打開 hadoop-env.sh 修改:
export JAVA_HOME=/opt/hadoop/jdk1.8 # 執行jdk
8、修改hadoop核心配置文件 : core-site.xml
添加如下內容
<configuration> <!--hdfs臨時路徑--> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/hadoop3/tmp</value> </property> <!--hdfs 的默認地址、端口 訪問地址--> <property> <name>fs.defaultFS</name> <value>hdfs://venn05:8020</value> </property> </configuration>
9、修改yarn-site.xml,添加如下內容
<configuration> <!-- Site specific YARN configuration properties --> <!--集群master,--> <property> <name>yarn.resourcemanager.hostname</name> <value>venn05</value> </property> <!-- NodeManager上運行的附屬服務--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--容器可能會覆蓋的環境變量,而不是使用NodeManager的默認值--> <property> <name>yarn.nodemanager.env-whitelist</name> <value> JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ</value> </property> <!-- 關閉內存檢測,虛擬機需要,不配會報錯--> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>
10、修改mapred-site.xml ,添加如下內容
<configuration> <!--local表示本地運行,classic表示經典mapreduce框架,yarn表示新的框架--> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!--如果map和reduce任務訪問本地庫(壓縮等),則必須保留原始值 當此值為空時,設置執行環境的命令將取決於操作系統: Linux:LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native. windows:PATH =%PATH%;%HADOOP_COMMON_HOME%\\bin. --> <property> <name>mapreduce.admin.user.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop3</value> </property> <!-- 可以設置AM【AppMaster】端的環境變量 如果上面缺少配置,可能會造成mapreduce失敗 --> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop3</value> </property> </configuration>
11、修改hdfs-site.xml ,添加如下內容
<configuration> <!--hdfs web的地址 --> <property> <name>dfs.namenode.http-address</name> <value>venn05:50070</value> </property> <!-- 副本數--> <property> <name>dfs.replication</name> <value>3</value> </property> <!-- 是否啟用hdfs權限檢查 false 關閉 --> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <!-- 塊大小,默認字節, 可使用 k m g t p e--> <property> <name>dfs.blocksize</name> <!--128m--> <value>134217728</value> </property> </configuration>
12、修workers 文件
[hadoop@venn05 hadoop]$ more workers venn05 # 第一個為master venn06 venn07
至此,hadoop master配置完成
13、scp .bashrc 、jdk 、hadoop到各個節點
進入hadoop home目錄
cd ~ scp -r .bashrc jdk1.8 hadoop3 hadoop@192.168.80.6:/opt/hadoop/ scp -r .bashrc jdk1.8 hadoop3 hadoop@192.168.80.7:/opt/hadoop/
至此hadoop集群搭建完成。
14、啟動hadoop:
格式化命名空間: hdfs namenode –formate 啟動集群: start-all.sh 輸出: start-all.sh WARNING: Attempting to start all Apache Hadoop daemons as hadoop in 10 seconds. WARNING: This is not a recommended production deployment configuration. WARNING: Use CTRL-C to abort. Starting namenodes on [venn05] Starting datanodes Starting secondary namenodes [venn05] Starting resourcemanager Starting nodemanagers

問題:提示權限不夠(解決花費時間:2h)

解決:sudo chmod -R a+w /opt/hadoop
問題:非root用戶不能無密碼訪問(解決花費時間:3天)
解決:
chmod 700 hadoop chmod 700 hadoop/.ssh chmod 644 hadoop/.ssh/authorized_keys chmod 600 hadoop/.ssh/id_rsa
問題:Cannot write namenode pid /tmp/hadoop-hadoop-namenode.pid.
解決:sudo chmod -R 777 /tmp

jps 查看進程: [hadoop@venn05 ~]$ jps 5904 Jps 5733 NodeManager 4871 NameNode 5431 ResourceManager 5211 SecondaryNameNode [hadoop@venn05 ~]$ 查看其它節點狀態: [hadoop@venn06 hadoop]$ jps 3093 NodeManager 3226 Jps 2973 DataNode
hadoop啟動成功
查看yarn web 控制台:

問題:訪問不了HDFS web頁面(解決花費時間:4h)
解決:HaDoop3.0之前web訪問端口是50070,hadoop3.0之后web訪問端口為9870,在hdfs-site.xml 把端口號改成9870即可,然后重新hdfs namenode –formate。Start-all.up。如果還是不能訪問,關閉linux防火牆。

15 hive 安裝
1)下載hive包
wget http://archive.apache.org/dist/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz
2)解壓到hadoop目錄
tar -zxvf apache-hive-2.3.3-bin.tar.gz #解壓 mv apache-hive-2.3.3-bin hive2.3.3 #修改目錄名,方便使用
3)配置hive環境變量
[hadoop@venn05 ~]$ more .bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # Uncomment the following line if you don't like systemctl's auto-paging feature: # export SYSTEMD_PAGER= # User specific aliases and functions #jdk export JAVA_HOME=/opt/hadoop/jdk1.8 export JRE_HOME=${JAVA_HOME}/jre export CLASS_PATH=${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH #hadoop export HADOOP_HOME=/opt/hadoop/hadoop3 export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH #hive export HIVE_HOME=/opt/hadoop/hive2.3.3 export HIVE_CONF_DIR=$HIVE_HOME/conf export PATH=$HIVE_HOME/bin:$PATH
4) 修改hive-env.sh
[hadoop@venn05 ~]$ cd hive2.3.3/conf [hadoop@venn05 conf]$ cp hive-env.sh.template hive-env.sh [hadoop@venn05 conf]$ vim hive-env.sh # HADOOP_HOME=${bin}/../../hadoop 打開注釋修改 HADOOP_HOME=/opt/hadoop/hadoop3 # export HIVE_CONF_DIR= 打開注釋修改 HIVE_CONF_DIR=/opt/hadoop/hive2.3.3/conf
5) 修改hive-log4j.properties
[hadoop@venn05 conf]$ mv hive-log4j2.properties.template hive-log4j2.properties [hadoop@venn05 conf]$ vim hive-log4j2.properties 找到 property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name} 修改 property.hive.log.dir = /opt/hadoop/hive2.3.3/logs
6) 確保萬一,還可修改hive-site.xml
[hadoop@venn05 conf]$ cp hive-default.xml.template hive-site.xml [hadoop@venn05 conf]$ vim hive-site.xml 修改1:將hive-site.xml 中的 “${system:java.io.tmpdir}” 都緩存具體目錄:/opt/hadoop/hive2.3.3/tmp 4處 修改2: 將hive-site.xml 中的 “${system:user.name}” 都緩存具體目錄:root 3處 <property> <name>hive.exec.local.scratchdir</name> <value>/opt/hadoop/hive2.3.3/tmp/root</value> <description>Local scratch space for Hive jobs</description> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/opt/hadoop/hive2.3.3/tmp/${hive.session.id}_resources</value> <description>Temporary local directory for added resources in the remote file system.</description> </property> <property> <name>hive.querylog.location</name> <value>/opt/hadoop/hive2.3.3/tmp/root</value> <description>Location of Hive run time structured log file</description> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/opt/hadoop/hive2.3.3/tmp/root/operation_logs</value> <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
7) 在hdfs上創建hive目錄
hadoop fs -mkdir -p /user/hive/warehouse #hive庫文件位置 hadoop fs -mkdir -p /tmp/hive/ #hive臨時目錄 #授權,不然會報錯: hadoop fs -chmod -R 777 /user/hive/warehouse hadoop fs -chmod -R 777 /tmp/hive
在配置文件添加hive-site.xml
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive</value> <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description> </property>
8)配置元數據庫
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hive'; #創建hive用戶 Query OK, 0 rows affected (0.00 sec) mysql> GRANT ALL ON *.* TO 'hive'@'%'; #授權 Query OK, 0 rows affected (0.00 sec) mysql> FLUSH PRIVILEGES; mysql> quit;
9)修改數據庫配置:
vim hive-site.xml
<property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <!-- 鏈接地址 --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://venn05:3306/hive?createDatabaseIfNotExist=true</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <!-- 用戶名 --> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>Username to use against metastore database</description> </property> <!-- 密碼 --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> <description>password to use against metastore database</description> </property>
10)修改hive-env.sh
export HADOOP_HOME=/opt/hadoop/hadoop3 export HIVE_CONF_DIR=/opt/hadoop/hive2.3.3/conf export HIVE_AUX_JARS_PATH=/opt/hadoop/hive2.3.3/lib
11)上傳mysql驅動包
下載地址:https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.19.zip

上傳到 hive2.3.3/lib
12) 初始化hive
schematool -initSchema -dbType mysql
問題:(解決花費時間:3天)

解決:從報錯信息來看:java.lang.NoSuchMethodError
原因:1.系統找不到相關jar包
2.同一類型的 jar 包有不同版本存在,系統無法決定使用哪一個
com.google.common.base.Preconditions.checkArgument
根據百度可知,該類來自於guava.jar
查看該jar包在hadoop和hive中的版本信息
hadoop(路徑:/opt/hadoop/hadoop3/share/hadoop/common/lib)中該jar包為 guava-27.0-jre.jar 
hive (路徑:/opt/hadoop/hive2.3.3/lib)中該jar包為guava-14.0.1.jar 
刪除hive中低版本的guava-14.0.1.jar包,
將hadoop中的guava-27.0-jre.jar復制到hive的lib目錄下即可。

成功!!!!!!!!!!!

具體處理天氣數據
1 運行環境說明
1.1 硬軟件環境
- 主機操作系統:Windows 64 bit,雙核4線程,主頻2.2G,8G內存
- 虛擬軟件:VMware® Workstation 15
- 虛擬機操作系統:CentOS 64位,單核,1G內存
- JDK:1.8
- Hadoop:3.1.3
1.2 機器網絡環境
集群包含三個節點:1個namenode、2個datanode,其中節點之間可以相互ping通。節點IP地址和主機名分布如下:
| 序號 |
IP地址 |
機器名 |
類型 |
用戶名 |
| 1 |
192.168.80.5 |
Venn05 |
名稱節點 |
Hadoop |
| 2 |
192.168.80.6 |
Venn06 |
數據節點 |
Hadoop |
| 3 |
192.168.80.7 |
Venn07 |
數據節點 |
Hadoop |
所有節點均是CentOS7 64bit系統,防火牆均禁用,所有節點上均創建了一個hadoop用戶,用戶主目錄是/opt/hadoop。
2 業務說明
求每日最高氣溫
2.1 下載數據集
由於老師提供的數據集無法下載,就找了類似的天氣數據集(NCDC)。ftp://ftp.ncdc.noaa.gov/pub/data/noaa。
wget -D --accept-regex=REGEX -P data -r –c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*
2.2 解壓數據集,並保存在文本文件中
zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*.gz > data.txt

查閱《1951—2007年中國地面氣候資料日值數據集台站信息》,可知數據格式含義
1-4 0169
5-10 501360 # USAF weather station identifier
11-15 99999 # WBAN weather station identifier
16-23 20170101 #記錄日期
24-27 0000 #記錄時間
28 4
29-34 +52130 #緯度(1000倍)
35-41 +122520 #經度(1000倍)
42-46 FM-12
47-51 +0433 #海拔(米)
52-56 99999
57-60 V020
61-63 220 #風向
64 1 #質量代碼
65 N
66-69 0010
70 1
71-75 02600 #雲高(米)
76 1
77 9
78 9
79-84 003700 #能見距離(米)
85 1
86 9
87 9
88-92 -0327 #空氣溫度(攝氏度*10)
93 1
94-98 -0363 #露點溫度(攝氏度*10)
99 1
100-104 10264 #大氣壓力
105 1
2.3 編寫MapReduce程序進行數據清理
Mapper程序
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String data = line.substring(15, 21); int airTemperature; if (line.charAt(87) == '+') { airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(data), new IntWritable(airTemperature)); } } }
Reducer程序
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } }
MaxTemperature程序
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; public class MaxTemperature extends Configured implements Tool { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err .println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Configuration conf = new Configuration(); conf.set("mapred.jar", "MaxTemperature.jar"); Job job = Job.getInstance(conf); job.setJarByClass(MaxTemperature.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub return 0; } }
2.4編譯java文件,打成jar包
注意自己hadoop的版本。
[root@venn05 hadoop]# javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-3.1.3.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.1.3.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar *.java [root@venn05 hadoop]# jar cvf MaxTemperature.jar *.class

2.5將數據上傳至hdfs上
[root@venn05 hadoop]# hadoop fs -put data.txt /data.txt

2.6 運行程序
hadoop jar MaxTemperature.jar MaxTemperature /data.txt /out
問題:java.net.NoRouteToHostException:主機沒有路由問題的解決。(解決花費時間:10min)


解決:只是關閉了venn05的防火牆,venn06、venn07的防火牆沒有關閉
問題:找不到/bin/java(解決花費時間:2h)

解決:建立軟連接ln -s /opt/hadoop/jdk1.8/bin/java /bin/java
問題:

解決:由於沒有啟動historyserver引起的,在mapred-site.xml配置文件中添加
<property> <name>mapreduce.jobhistory.address</name> <value>venn05:10020</value> </property>
成功!!!!!!
2020-06-01 00:01:31,548 INFO client.RMProxy: Connecting to ResourceManager at venn05/192.168.80.5:8032 2020-06-01 00:01:33,190 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2020-06-01 00:01:33,274 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1590940656612_0001 2020-06-01 00:01:33,616 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-06-01 00:01:35,332 INFO input.FileInputFormat: Total input files to process : 1 2020-06-01 00:01:35,543 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-06-01 00:01:35,632 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-06-01 00:01:35,671 INFO mapreduce.JobSubmitter: number of splits:3 2020-06-01 00:01:35,744 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 2020-06-01 00:01:36,020 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-06-01 00:01:36,171 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1590940656612_0001 2020-06-01 00:01:36,171 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2020-06-01 00:01:36,550 INFO conf.Configuration: resource-types.xml not found 2020-06-01 00:01:36,551 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2020-06-01 00:01:37,251 INFO impl.YarnClientImpl: Submitted application application_1590940656612_0001 2020-06-01 00:01:37,486 INFO mapreduce.Job: The url to track the job: http://venn05:8088/proxy/application_1590940656612_0001/ 2020-06-01 00:01:37,487 INFO mapreduce.Job: Running job: job_1590940656612_0001 2020-06-01 00:02:26,088 INFO mapreduce.Job: Job job_1590940656612_0001 running in uber mode : false 2020-06-01 00:02:26,094 INFO mapreduce.Job: map 0% reduce 0% 2020-06-01 00:03:24,492 INFO mapreduce.Job: map 6% reduce 0% 2020-06-01 00:03:30,433 INFO mapreduce.Job: map 26% reduce 0% 2020-06-01 00:03:31,480 INFO mapreduce.Job: map 31% reduce 0% 2020-06-01 00:03:36,616 INFO mapreduce.Job: map 33% reduce 0% 2020-06-01 00:03:43,030 INFO mapreduce.Job: map 39% reduce 0% 2020-06-01 00:03:58,376 INFO mapreduce.Job: map 100% reduce 0% 2020-06-01 00:04:19,353 INFO mapreduce.Job: map 100% reduce 100% 2020-06-01 00:04:29,534 INFO mapreduce.Job: Job job_1590940656612_0001 completed successfully 2020-06-01 00:04:35,358 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=18428403 FILE: Number of bytes written=37725951 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=346873253 HDFS: Number of bytes written=132 HDFS: Number of read operations=14 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Killed map tasks=1 Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=223517 Total time spent by all reduces in occupied slots (ms)=26522 Total time spent by all map tasks (ms)=223517 Total time spent by all reduce tasks (ms)=26522 Total vcore-milliseconds taken by all map tasks=223517 Total vcore-milliseconds taken by all reduce tasks=26522 Total megabyte-milliseconds taken by all map tasks=228881408 Total megabyte-milliseconds taken by all reduce tasks=27158528 Map-Reduce Framework Map input records=1423111 Map output records=1417569 Map output bytes=15593259 Map output materialized bytes=18428415 Input split bytes=276 Combine input records=0 Combine output records=0 Reduce input groups=12 Reduce shuffle bytes=18428415 Reduce input records=1417569 Reduce output records=12 Spilled Records=2835138 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=6797 CPU time spent (ms)=19450 Physical memory (bytes) snapshot=529707008 Virtual memory (bytes) snapshot=10921758720 Total committed heap usage (bytes)=429592576 Peak Map Physical memory (bytes)=157679616 Peak Map Virtual memory (bytes)=2728787968 Peak Reduce Physical memory (bytes)=137711616 Peak Reduce Virtual memory (bytes)=2735394816 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=346872977 File Output Format Counters Bytes Written=132 2020-06-01 00:04:37,460 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server


2.7 查看結果
由於這里的氣溫是攝氏度的10倍,所以看起來很大。把記錄保存下來。
hadoop fs -cat /out/part-r-00000
hadoop fs -copyToLocal /out/part-r-00000 result.txt

2.8導入hive數據庫
在導入前,將數據溫度轉變成正常溫度。
1)登錄hive
2)創建表
3)導入
create external table if not exists MaxTemperature ( tid INT, mTemp STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '`' STORED AS TEXTFILE; hive> load data local inpath '/opt/hadoop/data.txt' into table MaxTemperature;


問題:執行select語句出錯(解決花費時間:2h)

解決1:是由於權限不夠,更改權限即可。切換成hadoop用戶,執行上面的修改權限指令,因為hadoop的用戶是hdfs,所以不能用root修改。
[hadoop@venn05 ~]# hadoop fs -ls /tmp Found 2 items drwx------ - hadoop supergroup 0 2020-06-01 00:01 /tmp/hadoop-yarn drwx-wx-wx - root supergroup 0 2020-06-01 00:23 /tmp/hive [hadoop@venn05 ~]$ hadoop fs -chown -R root:root /tmp
解決2:上述操作執行后繼續報錯。(未解決,查資料可知是因為hive版本和hadoop版本不匹配)
Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator at org.apache.hadoop.hive.ql.exec.FetchOperator.<init>(FetchOperator.java:108) at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
2.9 導入mysql數據庫
[hadoop@venn05 ~]mysql –u root –p; mysql> CREATE DATABASE Temperature; mysql> use Temperature; mysql> use Temperature; mysql> CREATE TABLE MaxTemperature( > tid VARCHAR(20), > mTemp VARCHAR(20)); LOAD DATA LOCAL INFILE '/opt/hadoop/result.txt' INTO TABLE MaxTemperature
2.10 excel 展示數據

zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*.gz > data.txt


