1. 引言

在前一篇中，解決了Hive表中復雜數據結構平鋪化以導入Kylin的問題，但是平鋪之后計算廣告日志的曝光PV是翻倍的，因為一個用戶對應於多個標簽。所以，為了計算曝光PV，我們得另外創建視圖。

分析需求：

每個DSP上的曝光PV，標簽覆蓋的曝光PV；
累計曝光PV，累計標簽覆蓋曝光PV

相當於cube(dsp, tag) + measure(pv)，HiveQL如下：

select dsp, tag, count(*) as pv
from ad_view
where view = 'view' and day_time between '2016-04-18' and '2016-04-24'
group by dsp, tag with cube;

現在問題來了：如何將原始表中的tags array<struct<tag:string,label:string,src:string>> 轉換成有標簽（taged）、無標簽（empty）呢？顯而易見的辦法，為字段tags寫一個UDF來判斷是否有標簽。

2. 實戰

基本介紹

user-defined function (UDF)包括：

對於字段進行轉換操作的函數，如round()、abs()、concat()等；
聚集函數user-defined aggregate functions (UDAFs)，比如sum()、avg()等；
表生成函數user-defined table generating functions (UDTFs)，生成多列或多行數據，比如explode()、inline()等

UDTF的使用在與select語句使用時受到了限制，比如，不能與其他的列組合出現：

hive> SELECT name, explode(subordinates) FROM employees;
FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions

Hive提供LATERAL VIEW關鍵字，對UDTF的輸入進行包裝（wrap），如此可以達到列組合的效果：

hive> SELECT name, sub
> FROM employees
> LATERAL VIEW explode(subordinates) subView AS sub;

UDF與GenericUDF

org.apache.hadoop.hive.ql.exec.UDF是字段轉換操作的基類，提供對於簡單數據類型進行轉換操作。在實現轉換操作時，需要重寫evaluate()方法。較UDF抽象類，org.apache.hadoop.hive.ql.udf.generic.GenericUDF提供更為復雜的處理方法類，包括三個方法：

initialize(ObjectInspector[] arguments)，檢查輸入參數的類型、確定返回值的類型；
evaluate(DeferredObject[] arguments)，字段轉換操作的實現函數，其返回值的類型與initialize方法中所指定的返回類型保持一致；
getDisplayString(String[] children)，給Hadoop任務展示debug信息的。

判斷tags array<struct<tag:string,label:string,src:string>>是否為空標簽（EMPTY）的UDF實現如下：

@Description(name = "checkTag",
        value = "_FUNC_(array<struct>) - from the input array of struct "+
                "returns the TAGED or EMPTY(no tag).",
        extended = "Example:\n"
                + " > SELECT _FUNC_(tags_array) FROM src;")
public class CheckTag extends GenericUDF {
  private ListObjectInspector listOI;

  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    if (arguments.length != 1) {
      throw new UDFArgumentLengthException("only takes 1 arguments: List<T>");
    }

    ObjectInspector a = arguments[0];
    if (!(a instanceof ListObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list / array");
    }
    this.listOI = (ListObjectInspector) a;

    if(!(listOI.getListElementObjectInspector() instanceof StructObjectInspector)) {
      throw new UDFArgumentException("first argument must be a list of struct");
    }

    return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
  }

  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    if(listOI == null || listOI.getListLength(arguments[0].get()) == 0) {
      return "null_field";
    }

    StructObjectInspector structOI = (StructObjectInspector) listOI.getListElementObjectInspector();
    String tag = structOI.getStructFieldData(listOI.getListElement(arguments[0].get(), 0),
            structOI.getStructFieldRef("tag")).toString();

    if (listOI.getListLength(arguments[0].get()) == 1 && tag.equals("EMPTY")) {
      return "EMPTY";
    }
    return "TAGED";
  }

  public String getDisplayString(String[] children) {
    return "check tag whether is empty";
  }

}

還需添加依賴：

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>0.14.0</version>
  <scope>provided</scope>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.5.0-cdh5.3.2</version>
  <scope>provided</scope>
</dependency>

編譯后打成jar包，放在HDFS上，然后add jar即可調用該UDF了：

add jar hdfs://path/to/udf-1.0-SNAPSHOT.jar;
create temporary function checktag as 'com.hive.udf.CheckTag';

create view if not exists yooshu_view
partitioned on (day_time)
as
select uid, dsp, view, click, checktag(tags) as tag, day_time
from ad_base;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Hive五】Hive函數UDF Hive—UDF函數編寫 HIVE UDF函數和Transform Hive UDF函數測試 hive UDF添加方式 Hive UDF函數構建 hive UDF函數 Hive UDF開發-簡介 Hive之 Python寫UDF hive UDF函數