Hive UDF初探


1. 引言

前一篇中,解決了Hive表中復雜數據結構平鋪化以導入Kylin的問題,但是平鋪之后計算廣告日志的曝光PV是翻倍的,因為一個用戶對應於多個標簽。所以,為了計算曝光PV,我們得另外創建視圖。

分析需求:

  • 每個DSP上的曝光PV,標簽覆蓋的曝光PV;
  • 累計曝光PV,累計標簽覆蓋曝光PV

相當於cube(dsp, tag) + measure(pv),HiveQL如下:

select dsp, tag, count(*) as pv
from ad_view
where view = 'view' and day_time between '2016-04-18' and '2016-04-24'
group by dsp, tag with cube;

現在問題來了:如何將原始表中的tags array<struct<tag:string,label:string,src:string>> 轉換成有標簽(taged)、無標簽(empty)呢?顯而易見的辦法,為字段tags寫一個UDF來判斷是否有標簽。

2. 實戰

基本介紹

user-defined function (UDF)包括:

  • 對於字段進行轉換操作的函數,如round()、abs()、concat()等;
  • 聚集函數user-defined aggregate functions (UDAFs),比如sum()、avg()等;
  • 表生成函數user-defined table generating functions (UDTFs),生成多列或多行數據,比如explode()、inline()等

UDTF的使用在與select語句使用時受到了限制,比如,不能與其他的列組合出現:

hive> SELECT name, explode(subordinates) FROM employees;
FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions

Hive提供LATERAL VIEW關鍵字,對UDTF的輸入進行包裝(wrap),如此可以達到列組合的效果:

hive> SELECT name, sub
> FROM employees
> LATERAL VIEW explode(subordinates) subView AS sub;

UDF與GenericUDF

org.apache.hadoop.hive.ql.exec.UDF是字段轉換操作的基類,提供對於簡單數據類型進行轉換操作。在實現轉換操作時,需要重寫evaluate()方法。較UDF抽象類,org.apache.hadoop.hive.ql.udf.generic.GenericUDF提供更為復雜的處理方法類,包括三個方法:

  • initialize(ObjectInspector[] arguments),檢查輸入參數的類型、確定返回值的類型;
  • evaluate(DeferredObject[] arguments),字段轉換操作的實現函數,其返回值的類型與initialize方法中所指定的返回類型保持一致;
  • getDisplayString(String[] children),給Hadoop任務展示debug信息的。

判斷tags array<struct<tag:string,label:string,src:string>>是否為空標簽(EMPTY)的UDF實現如下:

@Description(name = "checkTag",
        value = "_FUNC_(array<struct>) - from the input array of struct "+
                "returns the TAGED or EMPTY(no tag).",
        extended = "Example:\n"
                + " > SELECT _FUNC_(tags_array) FROM src;")
public class CheckTag extends GenericUDF {
  private ListObjectInspector listOI;

  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    if (arguments.length != 1) {
      throw new UDFArgumentLengthException("only takes 1 arguments: List<T>");
    }

    ObjectInspector a = arguments[0];
    if (!(a instanceof ListObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list / array");
    }
    this.listOI = (ListObjectInspector) a;

    if(!(listOI.getListElementObjectInspector() instanceof StructObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list of struct");
    }

    return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
  }

  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    if(listOI == null || listOI.getListLength(arguments[0].get()) == 0) {
      return "null_field";
    }

    StructObjectInspector structOI = (StructObjectInspector) listOI.getListElementObjectInspector();
    String tag = structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), 0),
            structOI.getStructFieldRef("tag")).toString();

    if (listOI.getListLength(arguments[0].get()) == 1 && tag.equals("EMPTY")) {
      return "EMPTY";
    }
    return "TAGED";
  }

  public String getDisplayString(String[] children) {
    return "check tag whether is empty";
  }

}

還需添加依賴:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>0.14.0</version>
  <scope>provided</scope>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.5.0-cdh5.3.2</version>
  <scope>provided</scope>
</dependency>

編譯后打成jar包,放在HDFS上,然后add jar即可調用該UDF了:

add jar hdfs://path/to/udf-1.0-SNAPSHOT.jar;
create temporary function checktag as 'com.hive.udf.CheckTag';

create view if not exists yooshu_view
partitioned on (day_time)
as
select uid, dsp, view, click, checktag(tags) as tag, day_time
from ad_base;


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM