編寫Apache Hive用戶自定義函數（UDF）有兩個不同的接口，一個非常簡單，另一個...就相對復雜點。

如果你的函數讀和返回都是基礎數據類型（Hadoop&Hive 基本writable類型，如Text,IntWritable,LongWriable,DoubleWritable等等），那么簡單的API（org.apache.hadoop.hive.ql.exec.UDF）可以勝任

但是，如果你想寫一個UDF用來操作內嵌數據結構，如Map，List和Set，那么你要去熟悉org.apache.hadoop.hive.ql.udf.generic.GenericUDF這個API

簡單API： org.apache.hadoop.hive.ql.exec.UDF
復雜API： org.apache.hadoop.hive.ql.udf.generic.GenericUDF

接下來我將通過一個示例為上述兩個API建立UDF，我將為接下來的示例提供代碼與測試
如果你想瀏覽代碼：fork it on Github：https://github.com/rathboma/hive-extension-examples

簡單API

用簡單UDF API來構建一個UDF只涉及到編寫一個類繼承實現一個方法（evaluate），以下是示例：

[java] view plain copy

class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
return new Text("Hello " + input.toString());
}
}

因為該UDF是一個簡單的函數，你可以在規范的測試工具測試它，如JUnit。

[java] view plain copy

public class SimpleUDFExampleTest {
@Test
public void testUDF() {
SimpleUDFExample example = new SimpleUDFExample();
Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString());
}
}

好的，在Hive控制台測試一把，也可以在hive中直接測試這個UDF，特別是當你不完全肯定該函數是否能夠正確處理問題的時候

[plain] view plain copy

%> hive
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> CREATE TEMPORARY FUNCTION helloworld as 'com.matthewrathbone.example.SimpleUDFExample';
hive> select helloworld(name) from people limit 1000;

事實上，上述UDF有一個bug，不會去檢查null參數，null在一個大的數據集當中是很常見的，所以要適當嚴謹點。作為回應，這邊在函數中加了一個null檢查

[java] view plain copy

class SimpleUDFExample extends UDF {
public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}

然后加了一個測試去驗證它

[java] view plain copy

@Test
public void testUDFNullCheck() {
SimpleUDFExample example = new SimpleUDFExample();
Assert.assertNull(example.evaluate(null));
}

用mvn test跑一下測試，來保證所有用例通過。

復雜的API

org.apache.hadoop.hive.ql.udf.generic.GenericUDF API提供了一種方法去處理那些不是可寫類型的對象，例如：struct，map和array類型。

這個API需要你親自去為函數的參數去管理對象存儲格式（ object inspectors），驗證接收的參數的數量與類型。一個object inspector為內在的數據類型提供一個一致性接口，以至不同實現的對象可以在hive中以一致的方式去訪問（例如，只要你能提供一個對應的object inspector，你可以實現一個如Map的復合對象）。

這個API要求你去實現以下方法：

[java] view plain copy

// 這個類似於簡單API的evaluat方法，它可以讀取輸入數據和返回結果
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
// 該方法無關緊要，我們可以返回任何東西，但應當是描述該方法的字符串
abstract String getDisplayString(String[] children);
// 只調用一次，在任何evaluate()調用之前，你可以接收到一個可以表示函數輸入參數類型的object inspectors數組
// 這是你用來驗證該函數是否接收正確的參數類型和參數個數的地方
abstract ObjectInspector initialize(ObjectInspector[] arguments);

可能要通過一個示例才能去了解這個接口，所以接下來往下看。

示例

我將通過建立一個UDF函數：containsString，來加深對該API了解，該函數接收兩個參數：
一個String的列表（list）
一個String

根據該list中是否包含所提供的string來返回true或者false，如下：

[java] view plain copy

containsString(List("a", "b", "c"), "b"); // true
containsString(List("a", "b", "c"), "d"); // false

不同於UDF接口，這個GenericUDF接口需要更啰嗦點。

[java] view plain copy

class ComplexUDFExample extends GenericUDF {
ListObjectInspector listOI;
StringObjectInspector elementOI;
@Override
public String getDisplayString(String[] arg0) {
return "arrayContainsExample()"; // this should probably be better
}
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length != 2) {
throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
}
// 1. 檢查是否接收到正確的參數類型
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;
// 2. 檢查list是否包含的元素都是string
if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list of strings");
}
// 返回類型是boolean，所以我們提供了正確的object inspector
return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
// 利用object inspectors從傳遞的對象中得到list與string
List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
// 檢查空值
if (list == null || arg == null) {
return null;
}
// 判斷是否list中包含目標值
for(String s: list) {
if (arg.equals(s)) return new Boolean(true);
}
return new Boolean(false);
}
}

代碼走讀

函數的調用模塊如下：

1、該UDF用默認的構造器來初始化

2、udf.initialize() 被調用，傳人udf參數的object instructors數組，（ListObjectInstructor, StringObjectInstructor）
1) 檢查傳人的參數有兩個與該參數的數據類型是正確的（見上面）
2) 我們保存object instructors用以供evaluate()使用（listOI, elementOI）
3) 返回 object inspector，讓Hive能夠讀取該函數的返回結果（BooleanObjectInspector）

3、對於查詢中的每一行，evaluate方法都會被調用，傳人該行的指定的列（例如，evaluate(List(“a”, “b”, “c”), “c”) ）。
1) 我們利用initialize方法中存儲的object instructors來抽取出正確的值。
2) 我們在這處理我們的邏輯然后用initialize返回的object inspector來序列化返回來的值(list.contains(elemement) ? true : false)。

測試

測試該函數比較復雜的部分是初始化，一旦調用順序明確了，我們就知道怎么去構建該對象測試流程，非常簡單。

[java] view plain copy

public class ComplexUDFExampleTest {
@Test
public void testComplexUDFReturnsCorrectValues() throws HiveException {
// 建立需要的模型
ComplexUDFExample example = new ComplexUDFExample();
ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);
JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new ObjectInspector[]{listOI, stringOI});
// create the actual UDF arguments
List<String> list = new ArrayList<String>();
list.add("a");
list.add("b");
list.add("c");
// 測試結果
// 存在的值
Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("a")});
Assert.assertEquals(true, resultInspector.get(result));
// 不存在的值
Object result2 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("d")});
Assert.assertEquals(false, resultInspector.get(result2));
// 為null的參數
Object result3 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(null), new DeferredJavaObject(null)});
Assert.assertNull(result3);
}
}

結束語

希望這篇文章能夠讓你了解通過集成怎么去編寫hive的自定義函數。
雖然在這篇文章中有一些其他的東西沒提及到，但是另外有UDAF函數與UDTF函數，UDAF函數能夠在一個函數中處理與聚集多行數據，如果你更感興趣，這里有一些資源可以提供幫助。

另外，值得一讀的書籍有 Apache Hive Book from O’Reilly該數包含UDF與UDAF的簡明的教程，和代碼示例，更容易讓你們明白如何去構建這些函數、什么異常你必須要指定、什么類型你必須返回

翻譯來自於

http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive UDTF開發指南 Hive UDF開發-簡介 PJSIP開發指南 Hive UDF開發 paypal開發指南 MySQL開發指南 eBPF開發指南 PPAPI插件開發指南 javashop組件開發指南最強最全面的Hive SQL開發指南，超四萬字全面解析