一、词频统计
1、编写mapper.py
2、编写reduce.py
3、修改变量
重新运行变量
source ~/.bashrc
4、下载输入目标
5、上传目标
6、编写run.sh
gedit run.sh
7、运行run.sh
source run.sh
8、查看结果
hdfs dfs -cat wcoutput/part-000000
二、气象数据分析
1、批量下载气象数据
wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2020/5* 获取数据
2、解压数据集,并保存在本地文本文件中
zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*.gz >qxdata.txt 解压到一起
less -N qxdata.txt 查看数据
3、编写map与reduce函数
gedit mapper.py gedit reduce.py
mapper.py
#! /usr/bin/env python import sys for line in sys.stdin: line = line.strip() print('%s\t%d' % (line[15:23],int(line[87:92])))
reduce.py
1 #! /usr/bin/env python 2 from operator import itemgetter 3 import sys 4 5 current_date = None 6 current_temperature = 0 7 data = None 8 9 for line in sys.stdin: 10 line = line.strip() 11 date,temperature = line.split('\t',1) 12 try: 13 temperature = int(temperature) 14 except ValueError: 15 continue 16 if current_date == date: 17 if current_temperature < temperature: 18 current_temperature = temperature 19 else: 20 if current_date: 21 print('%s\t%d' % (current_date,current_temperature)) 22 current_temperature =temperature 23 current_date = date 24 if current_date ==date: 25 print('%s\t%d' % (current_date,current_temperature))
4、本地测试map与reduce
首先给权限
chmod 777 mapper.py chmod 777 reduce.py
编写一个临时测试数据
gedit qxdata.txt
0230592870999992020010100004+23220+113480FM-12+007299999V0200451N0018199999999011800199+01471+01011102791ADDAA106999999AA224000091AJ199999999999999AY101121AY201121GA1999+009001999GE19MSL +99999+99999GF107991999999009001999999IA2999+01309KA1240M+02331KA2240N+01351MA1999999101931MD1210131+0161MW1001OC100671OD141200671059REMSYN004BUFR
1 cat qxdata.txt | ./mapper.py |./reduce.py
5、将气象数据上传至HDFS上
hdfs dfs -put qxdata.txt input
6、用hadoop streaming提交任务
编写.sh运行文件、
gedit baiyun.sh
hadoop jar $STREAM \ -file /home/hadoop/baiyun/mapper.py \ -mapper /home/hadoop/baiyun/mapper.py \ -file /home/hadoop/baiyun/reduce.py \ -reducer /home/hadoop/baiyun/reduce.py \ -input /user/hadoop/input/qxdata.txt \ -output /user/hadoop/baiyunoutput
运行
source run.sh
7、查看运行结果
查看
hdfs dfs -cat baiyunoutput/part-00000
8、计算结果取回到本地
hdfs dfs -get baiyunoutput