一、詞頻統計
1、編寫mapper.py
2、編寫reduce.py
3、修改變量
重新運行變量
source ~/.bashrc
4、下載輸入目標
5、上傳目標
6、編寫run.sh
gedit run.sh
7、運行run.sh
source run.sh
8、查看結果
hdfs dfs -cat wcoutput/part-000000
二、氣象數據分析
1、批量下載氣象數據
wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2020/5* 獲取數據
2、解壓數據集,並保存在本地文本文件中
zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2020/5*.gz >qxdata.txt 解壓到一起
less -N qxdata.txt 查看數據
3、編寫map與reduce函數
gedit mapper.py gedit reduce.py
mapper.py
#! /usr/bin/env python import sys for line in sys.stdin: line = line.strip() print('%s\t%d' % (line[15:23],int(line[87:92])))
reduce.py
1 #! /usr/bin/env python 2 from operator import itemgetter 3 import sys 4 5 current_date = None 6 current_temperature = 0 7 data = None 8 9 for line in sys.stdin: 10 line = line.strip() 11 date,temperature = line.split('\t',1) 12 try: 13 temperature = int(temperature) 14 except ValueError: 15 continue 16 if current_date == date: 17 if current_temperature < temperature: 18 current_temperature = temperature 19 else: 20 if current_date: 21 print('%s\t%d' % (current_date,current_temperature)) 22 current_temperature =temperature 23 current_date = date 24 if current_date ==date: 25 print('%s\t%d' % (current_date,current_temperature))
4、本地測試map與reduce
首先給權限
chmod 777 mapper.py chmod 777 reduce.py
編寫一個臨時測試數據
gedit qxdata.txt
0230592870999992020010100004+23220+113480FM-12+007299999V0200451N0018199999999011800199+01471+01011102791ADDAA106999999AA224000091AJ199999999999999AY101121AY201121GA1999+009001999GE19MSL +99999+99999GF107991999999009001999999IA2999+01309KA1240M+02331KA2240N+01351MA1999999101931MD1210131+0161MW1001OC100671OD141200671059REMSYN004BUFR
1 cat qxdata.txt | ./mapper.py |./reduce.py
5、將氣象數據上傳至HDFS上
hdfs dfs -put qxdata.txt input
6、用hadoop streaming提交任務
編寫.sh運行文件、
gedit baiyun.sh
hadoop jar $STREAM \ -file /home/hadoop/baiyun/mapper.py \ -mapper /home/hadoop/baiyun/mapper.py \ -file /home/hadoop/baiyun/reduce.py \ -reducer /home/hadoop/baiyun/reduce.py \ -input /user/hadoop/input/qxdata.txt \ -output /user/hadoop/baiyunoutput
運行
source run.sh
7、查看運行結果
查看
hdfs dfs -cat baiyunoutput/part-00000
8、計算結果取回到本地
hdfs dfs -get baiyunoutput