(轉)使用Hive UDF和GeoIP庫為Hive加入IP識別功能

本文轉載自查看原文 2014-04-16 14:46 4608 hadoop

Hive是基於Hadoop的數據管理系統，作為分析人員的即時分析工具和ETL等工作的執行引擎，對於如今的大數據管理與分析、處理有着非常大的意義。GeoIP是一套IP映射數據庫，它定時更新，並且提供了各種語言的API，非常適合在做地域相關數據分析時的一個數據源。

Precondition:通過 IP 地址獲得用戶的地理位置信息

也就是根據用戶的IP，通過IP數據庫查詢獲得信息。一般IP數據庫中，

每條記錄的基本結構：

IP地址段（起始、結束），以及對應的信息數據 一般包含的信息：國家、區域（省/州）、城市、街道、經緯度、ISP提供商等信息
因為IP數據庫隨着時間經常變化（不過一段時間內變化很小），所以需要有人經常維護和更新。這個數據也不可能完全准確、也不可能覆蓋全。這是maxmind的城市准確度 http://www.maxmind.com/app/city_accuracy 。因為沒有權威的數據組織機構，且經常有變化。各家數據供應商，基本上做着做着就形成自己的一套數據了。
目前，國內用的比較有名的是“純真IP數據庫”，國外常用的是 maxmind、ip2location。
IP數據庫是否收費：收費、免費都有。一般有人維護的數據往往都是收費的，准確率和覆蓋率會稍微高一些。
質量方面：

主要概念是准確率和覆蓋率。
記錄數據總條數。純真現在是38萬條（2010年07月30日更新）
是否有人維護。
數據庫更新頻率：每月、每周。數據庫會定期更新的，maxmind開源版是每月更新一次。

查詢形式：

本地，將IP數據庫下載到本地使用，查詢效率高、性能好。常用在統計分析方面。具體形式又分為：

內存查詢：將全部數據直接加載到內存中，便於高性能查詢。或者二進制的數據文件本身就是經過優化的索引文件，可以直接對文件做查詢。
數據庫查詢：將數據導入到數據庫，再用數據庫查詢。效率沒有內存查詢快。

遠程（web service或ajax），調用遠程第三方服務。查詢效率自然比較低，一般用在網頁應用中。

查詢的本質：輸入一個IP，找到其所在的IP段，一般都是采用二分搜索實現的。

是否提供API：有的IP數據庫提供API，支持多語言（java、javascript、C#等），這樣你就不用自己直接分析數據格式、整理、寫查詢代碼了。
是否提供經緯度：純真IP數據庫不提供經緯度，Maxmind提供,如果做地圖應用，一般是需要經緯度的。

而UDF是Hive提供的用戶自定義函數的接口，通過實現它可以擴展Hive目前已有的內置函數。而為Hive加入一個IP映射函數，我們只需要簡單地在UDF中調用GeoIP的Java API即可。

GeoIP的數據文件可以從這里下載：http://www.maxmind.com/download/geoip/database/，由於需要國家和城市的信息，我這里下載的是http://www.maxmind.com/download/geoip/database /GeoLiteCity.dat.gz

GeoIP的各種語言的API可以從這里下載：http://www.maxmind.com/download/geoip/api/

操作Steps如下：

Step 1：Hive所需添加的IP地址信息識別UDF函數如下：

package org.hadoop.hive.additionalUDF;

import java.io.File;
import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;

import com.maxmind.geoip.Location;
import com.maxmind.geoip.LookupService;
import com.maxmind.geoip.regionName;
import com.maxmind.geoip.timeZone;

import java.util.regex.*;

public class IPToCC  extends UDF {
    private static LookupService cl = null;
    private static String ipPattern = "\\d+\\.\\d+\\.\\d+\\.\\d+";
    private static String ipNumPattern = "\\d+";
    
    static LookupService getLS(String dbfile) throws IOException{
        
        //String sep = System.getProperty("file.separator");
        //String dir = "/home/landen/UntarFile/GeoIP";

        //String dbfile = dir + sep + "GeoLiteCity.dat";
        //String dbfile = "GeoLiteCity.dat";
        if(new File(dbfile).exists())
        {
            if(cl == null)
            {
                cl = new LookupService(dbfile,LookupService.GEOIP_MEMORY_CACHE);
            }    
        }
        
        return cl;

    }
    
    /**
     * @param str like "114.43.181.143"
     * */
    
    public String evaluate(String str,String ipDBInfo) {
        try
        {
            Location l1 = null;
            Matcher mIP = Pattern.compile(ipPattern).matcher(str);
            Matcher mIPNum = Pattern.compile(ipNumPattern).matcher(str);
            if(mIP.matches())
                l1 = getLS(ipDBInfo).getLocation(str);
            else if(mIPNum.matches())
                l1 = getLS(ipDBInfo).getLocation(Long.parseLong(str));    
            
            /*System.out.println("countryCode: " + l1.countryCode +
                    "\n countryName: " + l1.countryName +
                    "\n region: " + l1.region +
                    "\n regionName: " + regionName.regionNameByCode(l1.countryCode, l1.region) +
                    "\n city: " + l1.city +
                    "\n latitude: " + l1.latitude +
                    "\n longitude: " + l1.longitude +
                    "\n timezone: " + timeZone.timeZoneByCountryAndRegion(l1.countryCode, l1.region));*/
            
            return String.format("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s",l1.countryCode,l1.countryName,l1.region,regionName.regionNameByCode(l1.countryCode, l1.region),l1.city,l1.latitude,l1.longitude,timeZone.timeZoneByCountryAndRegion(l1.countryCode, l1.region));
        }
        catch(Exception e)
        {
            e.printStackTrace();
            if(cl != null)
                cl.close();
            return null;
        }
    }
    
    public static void main(String[] args)
    {
        String dbfile = "GeoLiteCity.dat";
        IPToCC ipTocc = new IPToCC();
        String ipAdress = "221.12.10.218";
        
        System.out.println(ipTocc.evaluate(ipAdress,dbfile));
    }

}
Step 2.將以上程序和GeoIP的API程序，一起打成JAR包IPToCC.jar，和數據文件（GeoLiteCity.dat）一起放到Hive所在的服務器的一個位置。然后可以按照以下兩種方式將以上資源添加到Hive中：
1> 打開Hive執行以下語句：
landen@Master:~/UntarFile/hive-0.10.0$ bin/hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in jar:file:/home/landen/UntarFile/hive-0.10.0/lib/hive-common-0.10.0.jar!/hive-log4j.properties
Hive history file=/home/landen/UntarFile/hive-0.10.0/logs/hive_job_log_landen_201312081638_1930432077.txt
hive (default)> use stuchoosecourse;
OK
Time taken: 5.251 seconds
hive (stuchoosecourse)> add file /home/landen/UntarFile/GeoIP/GeoLiteCity.dat;
Added resource: /home/landen/UntarFile/GeoIP/GeoLiteCity.dat
hive (stuchoosecourse)> add jar /home/landen/UntarFile/hive-0.10.0/lib/IPTocc.jar;
Added /home/landen/UntarFile/hive-0.10.0/lib/IPTocc.jar to class path
Added resource: /home/landen/UntarFile/hive-0.10.0/lib/IPTocc.jar
hive (stuchoosecourse)> create temporary function IP4Tocc as 'org.hadoop.hive.additionalUDF.IPToCC';
OK
Time taken: 0.107 seconds
2> 在啟動hive shell命令前，在$HIVE_HOME/conf目錄下添加.hiverc文件，然后添加如下內容：
add file /home/landen/UntarFile/GeoIP/GeoLiteCity.dat;
add jar /home/landen/UntarFile/hive-0.10.0/lib/IPTocc.jar;
create temporary function IP4Tocc as 'org.hadoop.hive.additionalUDF.IPToCC';
當啟動hive shell命令后，hive會將加載.hiverc文件內容並添加到全局內容中，便於client使用

Step 3：Hive測試內容如下：
hive (stuchoosecourse)> select * from ipidentifier; 
OK
ipadress
221.12.10.218
60.180.248.201
125.111.251.118
Time taken: 0.099 seconds
hive (stuchoosecourse)> select IP4Tocc(ipadress,'./GeoLiteCity.dat') from ipidentifier;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312042044_0020, Tracking URL = http://Master:50030/jobdetails.jsp?jobid=job_201312042044_0020
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201312042044_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-08 20:54:10,276 Stage-1 map = 0%,  reduce = 0%
2013-12-08 20:54:18,308 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:19,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:20,317 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:21,322 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:22,326 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:23,331 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.55 sec
2013-12-08 20:54:24,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.55 sec
MapReduce Total cumulative CPU time: 2 seconds 550 msec
Ended Job = job_201312042044_0020
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.55 sec   HDFS Read: 306 HDFS Write: 188 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 550 msec
OK
_c0
CN    China    02    Zhejiang    Hangzhou    30.293594    120.16141    Asia/Shanghai
CN    China    02    Zhejiang    Wenzhou    27.999405    120.66681    Asia/Shanghai
CN    China    02    Zhejiang    Ningbo    29.878204    121.5495    Asia/Shanghai
hive (stuchoosecourse)> select split(IP4Tocc(ipadress,'./GeoLiteCity.dat'),'\t') from ipidentifier;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312042044_0021, Tracking URL = http://Master:50030/jobdetails.jsp?jobid=job_201312042044_0021
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201312042044_0021
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-08 21:12:46,717 Stage-1 map = 0%,  reduce = 0%
2013-12-08 21:12:56,764 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:12:57,768 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:12:58,772 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:12:59,775 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:13:00,778 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:13:01,782 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.28 sec
2013-12-08 21:13:02,786 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.28 sec
MapReduce Total cumulative CPU time: 4 seconds 280 msec
Ended Job = job_201312042044_0021
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 4.28 sec   HDFS Read: 306 HDFS Write: 188 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 280 msec
OK
_c0
["CN","China","02","Zhejiang","Hangzhou","30.293594","120.16141","Asia/Shanghai"]
["CN","China","02","Zhejiang","Wenzhou","27.999405","120.66681","Asia/Shanghai"]
["CN","China","02","Zhejiang","Ningbo","29.878204","121.5495","Asia/Shanghai"]
Time taken: 45.037 seconds
hive (stuchoosecourse)> create table HiddenIPInfo(
                      > IP string,countrycode string,countryname string,region string,regionname string,city string,      
                      > latitude string,longitude string,timezone string);
OK
Time taken: 1.828 seconds
hive (stuchoosecourse)> show tables;
OK
tab_name
hbase_stu_course
hiddenipinfo
ipidentifier
Time taken: 0.486 seconds
hive (stuchoosecourse)> describe hiddenipinfo;
OK
col_name    data_type    comment
ip    string    
countrycode    string    
countryname    string    
region    string    
regionname    string    
city    string    
latitude    string    
longitude    string    
timezone    string    
Time taken: 0.33 seconds
hive (stuchoosecourse)> from(select ipadress,split(IP4Tocc(ipadress,'./GeoLiteCity.dat'),'\t') as IPInfo from ipidentifier)e
                      > insert overwrite table hiddenipinfo
                      > select e.ipadress,e.IPInfo[0] as countrycode,e.IPInfo[1] as countryname,e.IPInfo[2] as region,
                      > e.IPInfo[3] as regionname,e.IPInfo[4] as city,e.IPInfo[5] as latitude,e.IPInfo[6] as longitude,
                      > e.IPInfo[7] as timezone;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201312042044_0023, Tracking URL = http://Master:50030/jobdetails.jsp?jobid=job_201312042044_0023
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201312042044_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-08 21:58:12,406 Stage-1 map = 0%,  reduce = 0%
2013-12-08 21:58:18,449 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:19,454 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:20,458 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:21,462 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:22,466 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:23,470 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
2013-12-08 21:58:24,474 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.48 sec
MapReduce Total cumulative CPU time: 1 seconds 480 msec
Ended Job = job_201312042044_0023
Ended Job = 39195028, job is filtered out (removed at runtime).
Ended Job = 1695434910, job is filtered out (removed at runtime).
Moving data to: hdfs://Master:9000/home/landen/UntarFile/hive-0.10.0/warehouse/hive_2013-12-08_21-57-40_106_7083774091282915969/-ext-10000
Loading data to table stuchoosecourse.hiddenipinfo
Deleted hdfs://Master:9000/home/landen/UntarFile/hive-0.10.0/warehouse/stuchoosecourse.db/hiddenipinfo
Table stuchoosecourse.hiddenipinfo stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 233, raw_data_size: 0]
Rows loaded to hiddenipinfo
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 1.48 sec   HDFS Read: 306 HDFS Write: 233 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 480 msec
OK
ipadress    countrycode    countryname    region    regionname    city    latitude    longitude    timezone
Time taken: 45.692 seconds
hive (stuchoosecourse)> show tables;
OK
tab_name
hbase_stu_course
hiddenipinfo
ipidentifier
Time taken: 0.053 seconds
hive (stuchoosecourse)> select * from hiddenipinfo;
OK
ip               countrycode    countryname    region    regionname    city       latitude    longitude    timezone
221.12.10.218    CN             China          02        Zhejiang      Hangzhou   30.293594   120.16141    Asia/Shanghai
60.180.248.201   CN             China          02        Zhejiang      Wenzhou    27.999405   120.66681    Asia/Shanghai
125.111.251.118  CN             China          02        Zhejiang      Ningbo     29.878204   121.5495     Asia/Shanghai
Time taken: 0.083 seconds

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【轉】HIVE UDF UDAF UDTF 區別使用 hive下UDF函數的使用 php使用geoip通過用戶ip獲取信息 Hive自定義UDF的JAR包加入運行環境的方法 Hive—UDF函數編寫 HIVE UDF函數和Transform python geoip2使用 hive java編寫udf函數 [一起學Hive]之十八-Hive UDF開發 SparkSql使用Hive中注冊的UDF函數報類找不到問題解決