大數據(2):基於sogou.500w.utf8數據hive的實踐


 

一.環境的搭建

1.安裝配置mysql

rpm –ivh MySQL-server-5.6.14.rpm

rpm –ivh MySQL-client-5.6.14.rpm

啟動mysql

創建hive用戶

grant all on *.* to hadoop@’%’ identified by ‘hadoop’;

grant all on *.* to hadoop@’localhost’ identified by ‘hadoop’;

grant all on *.* to hadoop@’master’ identified by ‘hadoop’;

創建hive數據庫

create database hive_1;

2. hive的安裝

tar –zxvf  apache-hive-0.13-1-bin.tar.gz

3. hive的配置

vi /apache-hive-0.13-1-bin/conf/hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>

<property>

<name>hive.metastore.local</name>

<value>true</value>

</property>

<property>

<name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://master:3306/hive_1?characterEncoding=UTF-8</value>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name> <value>hadoop</value>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value>

 </property>

</configuration>

4. tar -zxvf  mysql-connector-java-5.1.27.tar.gz

5. 將java connector復制到依賴庫中

cp mysql-connector-java-5.1.27-bin.jar ~/apache-hive-0.13-1-bin/lib/

6. 配置環境變量

vi  .bash_profile

HIVE_HOME = /home/gdou/apache-hive-0.13-1-bin

PATH=$PATH:$HIVE_HOME/bin

二.搜狗日志數據分析

1. sogou.500w.utf8預處理

數據格式

 訪問時間\t  用戶ID \t  關鍵詞 \t  排名\t  \頁數 \t  URL

2.查看數據

 less sogou.500w.utf8

 wc -l sogou.500w.utf8

      head -100 sogou.500w.utf8  sogou.tmp

3. 數據擴展

 將時間字段拆分並拼接,添加年 月 日 小時字段

 bash sogou-log-extend.sh sogou.500w.utf8 sogou.500w.utf8.ext

4. 數據過濾

 過濾第二字段UID和第三個字段關鍵字為空的行

bash sogou-log-filter.sh sogou.500w.utf8.ext sogou.500w.utf8  sogou.500w.utf8.flt

5.將文件sogou.500w.utf8和sogou.500w.utf8.flt上傳至HDFS上。

6.HiveQL

—   基本操作

—  hive > show databases;

—  hive > create database sogou;

—  hive > use sogou;

—  hive > show tables;

—  hive > create external table sogou.sogou20111230(timestamp string, uid string, keyword string , rank int, order int, url string)

    > comment 'this is a sogou table'

    > row format delimited

    > fields terminated by '\t'

    > stored as textfile

    > location 'hdfs://master:9000/sogou/20111230';

— hive>show create table sogou.sogou20111230;

— hive>describe sogou.sogou20111230;

— hive> select * from sogou.sogou20111230 limit 3;

— select count(*) from sogou.sogou20111230;

— select count(distinct uid) from sogou.sogou20111230;

7.用hiveQL完成下列查詢(寫出HiveQL語句)

1)統計關鍵字非空查詢的條數;

select count(*) from sogou.sogou20111230 where keyword is not null;

結果為:5000000

2)查詢頻度最高的前五十個關鍵字;

select keyword,count(keyword) as num from sogou.sogou20111230 group by keyword order by num desc limit 50;

 

3)統計每個uid的平均查詢次數

select avg(bb.num) from (select count(b.uid) as num from sogou.sogou20111230 b group by b.uid) as bb;

 

輸出結果:3.69

4)搜索關鍵字內容包含‘仙劍奇俠’超過三次的用戶id

select tt.uid,tt.num from(select t.uid,count(t.uid) as num from (select * from sogou.sogou20111230 where keyword like concat('%','仙劍奇俠','%')) as t group by t.uid order by num desc) tt where tt.num > 3 limit 50;

輸出結果:

 

5)查找直接輸入URL作為關鍵字的條目;

select * from sogou.sogou20111230 where keyword rlike '[a-zA-z]+://[^\s]*' limit 50;

 

 

6)統計不重復的uid的行數;

select count(DISTINCT uid) from sogou.sogou20111230;

 

輸出結果:1352664

 

相關資料:

鏈接:http://pan.baidu.com/s/1dFD7mdr 密碼:xwu8


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM